Re: [PATCH] KVM: PPC: Book3S PR: Do not fail emulation with mtspr/mfspr for unknown SPRs

2017-04-03 Thread Paul Mackerras
On Mon, Apr 03, 2017 at 01:23:15PM +0200, Thomas Huth wrote:
> According to the PowerISA 2.07, mtspr and mfspr should not generate
> an illegal instruction exception when being used with an undefined SPR,
> but rather treat the instruction as a NOP, inject a privilege exception
> or an emulation assistance exception - depending on the SPR number.

The emulation assist interrupt is a hypervisor interrupt, so the guest
would not be expecting to receive it.  On a real machine, the
hypervisor would synthesize an illegal instruction type program
interrupt as described in the last programming note in section 6.5.9
of Book III of Power ISA v2.07B.  Since we are the hypervisor here, we
should synthesize a program interrupt rather than an emulation assist
interrupt.

> Also turn the printk here into a ratelimited print statement, so that
> the guest can not flood the dmesg log of the host by issueing lots of
> illegal mtspr/mfspr instruction here.

Good idea.

Paul.


Re: [PATCH] KVM: PPC: Book3S PR: Do not always inject facility unavailable exceptions

2017-04-03 Thread Paul Mackerras
On Mon, Apr 03, 2017 at 01:28:34PM +0200, Thomas Huth wrote:
> KVM should not inject a facility unavailable exception into the guest
> when it tries to execute a mtspr/mfspr instruction for an SPR that
> is unavailable, and the vCPU is *not* running in PRoblem state.
> 
> It's right that we inject an exception when the vCPU is in PR mode, since
> chapter "6.2.10 Facility Status and Control Register" of the PowerISA
> v2.07 says that "When the FSCR makes a facility unavailable, attempted
> usage of the facility in *problem state* is treated as follows: [...]
> Access of an SPR using mfspr/mtspr causes a Facility Unavailable
> interrupt". But if the guest vCPU is not in PR mode, we should follow
> the behavior that is described in chapter "4.4.4 Move To/From System
> Register Instructions" instead and treat the instruction as a NOP.

This doesn't seem quite right.  My reading of the ISA is that the FSCR
bit for a facility being 0 doesn't prevent privileged code from
accessing the facility, so we shouldn't be treating mfspr/mtspr as
NOP.  Instead we should be set the facility's bit in the shadow
FSCR and re-execute the instruction (remembering of course to clear
the FSCR bit when we go back to emulated problem state).

For TM it's a bit different as the MSR[TM] bit does prevent privileged
code from accessing TM registers and instructions, so for TM we should
be delivering a facility unavailable interrupt even when the guest is
in emulated privileged state.

So I don't see any case where mfspr/mtspr should be treated as a NOP
in response to a facility unavailable interrupt.

Paul.


[PATCH] powerpc/hugetlb: Add ABI defines for MAP_HUGE_16MB and MAP_HUGE_16GB

2017-04-03 Thread Anshuman Khandual
This just adds user space exported ABI definitions for both 16MB and
16GB non default huge page sizes to be used with mmap() system call.

Signed-off-by: Anshuman Khandual 
---
These defined values will be used along with MAP_HUGETLB while calling
mmap() system call if the desired HugeTLB page size is not the default
one. Follows similar definitions present in x86.

arch/x86/include/uapi/asm/mman.h:#define MAP_HUGE_2MB(21 << MAP_HUGE_SHIFT)
arch/x86/include/uapi/asm/mman.h:#define MAP_HUGE_1GB(30 << MAP_HUGE_SHIFT)

 arch/powerpc/include/uapi/asm/mman.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/mman.h 
b/arch/powerpc/include/uapi/asm/mman.h
index 03c06ba..e78980b 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -29,4 +29,7 @@
 #define MAP_STACK  0x2 /* give out an address that is best 
suited for process/thread stacks */
 #define MAP_HUGETLB0x4 /* create a huge page mapping */
 
+#define MAP_HUGE_16MB  (24 << MAP_HUGE_SHIFT)  /* 16MB HugeTLB Page */
+#define MAP_HUGE_16GB  (34 << MAP_HUGE_SHIFT)  /* 16GB HugeTLB Page */
+
 #endif /* _UAPI_ASM_POWERPC_MMAN_H */
-- 
1.8.5.2



Re: [7/7] crypto: caam/qi - add ablkcipher and authenc algorithms

2017-04-03 Thread Michael Ellerman
Horia Geantă  writes:

> Add support to submit ablkcipher and authenc algorithms
> via the QI backend:
> -ablkcipher:
> cbc({aes,des,des3_ede})
> ctr(aes), rfc3686(ctr(aes))
> xts(aes)
> -authenc:
> authenc(hmac(md5),cbc({aes,des,des3_ede}))
> authenc(hmac(sha*),cbc({aes,des,des3_ede}))
>
> caam/qi being a new driver, let's wait some time to settle down without
> interfering with existing caam/jr driver.
> Accordingly, for now all caam/qi algorithms (caamalg_qi module) are
> marked to be of lower priority than caam/jr ones (caamalg module).
>
> Signed-off-by: Vakul Garg 
> Signed-off-by: Alex Porosanu 
> Signed-off-by: Horia Geantă 
> ---
>  drivers/crypto/caam/Kconfig|   20 +-
>  drivers/crypto/caam/Makefile   |1 +
>  drivers/crypto/caam/caamalg.c  |9 +-
>  drivers/crypto/caam/caamalg_desc.c |   77 +-
>  drivers/crypto/caam/caamalg_desc.h |   15 +-
>  drivers/crypto/caam/caamalg_qi.c   | 2387 
> 
>  drivers/crypto/caam/sg_sw_qm.h |  108 ++
>  7 files changed, 2601 insertions(+), 16 deletions(-)
>  create mode 100644 drivers/crypto/caam/caamalg_qi.c
>  create mode 100644 drivers/crypto/caam/sg_sw_qm.h


This appears to be blowing up my Freescale (NXP) P5020DS board:

  Unable to handle kernel paging request for data at address 0x0020
  Faulting instruction address: 0xc04393e4
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=24 
  CoreNet Generic
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.11.0-rc3-compiler_gcc-4.6.3-00046-gb189817cf789 #5
  task: c000f70c task.stack: c000f70c8000
  NIP: c04393e4 LR: c04aeba0 CTR: c04fa7d8
  REGS: c000f70cb160 TRAP: 0300   Not tainted  
(4.11.0-rc3-compiler_gcc-4.6.3-00046-gb189817cf789)
  MSR: 80029000 
CR: 24adbe48  XER: 2000
  DEAR: 0020 ESR:  SOFTE: 1 
  GPR00: c06feba0 c000f70cb3e0 c0e6  
  GPR04: 0001  c0e0b290 0003 
  GPR08: 0004 c0ea5280 0004 0004 
  GPR12: 88adbe22 c0003fff5000 c0ba3518 880088090fa8 
  GPR16: 1000 c0ba3500 c000f72c68d8 0004 
  GPR20: c0ea5280 c0ba34e8 0020 0004 
  GPR24: c0eab7c0  c000f7fc8818 c0eb 
  GPR28: c000f786cc00 c0eab780 f786cc00 c0eab7c0 
  NIP [c04393e4] .gen_pool_alloc+0x0/0xc
  LR [c04aeba0] .qman_alloc_cgrid_range+0x24/0x54
  Call Trace:
  [c000f70cb3e0] [c0504054] 
.platform_device_register_full+0x12c/0x150 (unreliable)
  [c000f70cb460] [c06feba0] .caam_qi_init+0x158/0x63c
  [c000f70cb5f0] [c06fc64c] .caam_probe+0x8b8/0x1830
  [c000f70cb740] [c0503288] .platform_drv_probe+0x60/0xac
  [c000f70cb7c0] [c0501194] .driver_probe_device+0x248/0x344
  [c000f70cb870] [c05013b4] .__driver_attach+0x124/0x128
  [c000f70cb900] [c04fed90] .bus_for_each_dev+0x80/0xcc
  [c000f70cb9a0] [c0500858] .driver_attach+0x24/0x38
  [c000f70cba10] [c050043c] .bus_add_driver+0x14c/0x29c
  [c000f70cbab0] [c0501d64] .driver_register+0x8c/0x154
  [c000f70cbb30] [c0503000] .__platform_driver_register+0x48/0x5c
  [c000f70cbba0] [c0c7f798] .caam_driver_init+0x1c/0x30
  [c000f70cbc10] [c0001904] .do_one_initcall+0x60/0x1a8
  [c000f70cbcf0] [c0c35f30] .kernel_init_freeable+0x248/0x334
  [c000f70cbdb0] [c00020fc] .kernel_init+0x1c/0xf20
  [c000f70cbe30] [c9bc] .ret_from_kernel_thread+0x58/0x9c
  Instruction dump:
  eb61ffd8 eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3860 4bb0 
  7ce53b78 4b0c 7f67db78 4b24  e8c30028 4bfffd30 fbe1fff8 
  ---[ end trace 9f61087068991b02 ]---


home:linux-next(4)(I)> git bisect log
...
git bisect bad b189817cf7894e03fd3700acd923221d3007259e
# first bad commit: [b189817cf7894e03fd3700acd923221d3007259e] crypto: caam/qi 
- add ablkcipher and authenc algorithms


The oops is saying gen_pool_alloc() was called with a NULL pointer, so
it seems qm_cgralloc is NULL:

static int qman_alloc_range(struct gen_pool *p, u32 *result, u32 cnt)
{
unsigned long addr;

addr = gen_pool_alloc(p, cnt);
...

int qman_alloc_cgrid_range(u32 *result, u32 count)
{
return qman_alloc_range(qm_cgralloc, result, count);
}


I didn't pull the thread any further than that.

cheers


[PATCH] powerpc: Don't try to fix up misaligned load-with-reservation instructions

2017-04-03 Thread Paul Mackerras
In the past, there was only one load-with-reservation instruction,
lwarx, and if a program attempted a lwarx on a misaligned address, it
would take an alignment interrupt and the kernel handler would emulate
it as though it was lwzx, which was not really correct, but benign
since it is loading the right amount of data, and the lwarx should be
paired with a stwcx. to the same address, which would also cause an
alignment interrupt which would result in a SIGBUS being delivered to
the process.

We now have 5 different sizes of load-with-reservation instruction.
Of those, lharx and ldarx cause an immediate SIGBUS by luck since
their entries in aligninfo[] overlap instructions which were not
fixed up, but lqarx overlaps with lhz and will be emulated as such.
lbarx can never generate an alignment interrupt since it only
operates on 1 byte.

To straighten this out and fix the lqarx case, this adds code to
detect the l[hwdq]arx instructions and return without fixing them
up, resulting in a SIGBUS being delivered to the process.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kernel/align.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index cbc7c42cdb74..ec7a8b099dd9 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -807,14 +807,25 @@ int fix_alignment(struct pt_regs *regs)
nb = aligninfo[instr].len;
flags = aligninfo[instr].flags;
 
-   /* ldbrx/stdbrx overlap lfs/stfs in the DSISR unfortunately */
-   if (IS_XFORM(instruction) && ((instruction >> 1) & 0x3ff) == 532) {
-   nb = 8;
-   flags = LD+SW;
-   } else if (IS_XFORM(instruction) &&
-  ((instruction >> 1) & 0x3ff) == 660) {
-   nb = 8;
-   flags = ST+SW;
+   /*
+* Handle some cases which give overlaps in the DSISR values.
+*/
+   if (IS_XFORM(instruction)) {
+   switch (get_xop(instruction)) {
+   case 532:   /* ldbrx */
+   nb = 8;
+   flags = LD+SW;
+   break;
+   case 660:   /* stdbrx */
+   nb = 8;
+   flags = ST+SW;
+   break;
+   case 20:/* lwarx */
+   case 84:/* ldarx */
+   case 116:   /* lharx */
+   case 276:   /* lqarx */
+   return 0;   /* not emulated ever */
+   }
}
 
/* Byteswap little endian loads and stores */
-- 
2.11.0



Re: [PATCH v6 06/11] powerpc/perf: IMC pmu cpumask and cpu hotplug support

2017-04-03 Thread Daniel Axtens
Madhavan Srinivasan  writes:

> From: Hemant Kumar 
>
> Adds cpumask attribute to be used by each IMC pmu. Only one cpu (any
> online CPU) from each chip for nest PMUs is designated to read counters.
>
> On CPU hotplug, dying CPU is checked to see whether it is one of the
> designated cpus, if yes, next online cpu from the same chip (for nest
> units) is designated as new cpu to read counters. For this purpose, we
> introduce a new state : CPUHP_AP_PERF_POWERPC_NEST_ONLINE.
>
> Signed-off-by: Anju T Sudhakar 
> Signed-off-by: Hemant Kumar 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/include/asm/opal-api.h|  13 ++-
>  arch/powerpc/include/asm/opal.h|   3 +
>  arch/powerpc/perf/imc-pmu.c| 155 
> -
>  arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
>  include/linux/cpuhotplug.h |   1 +
>  5 files changed, 171 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index a0aa285869b5..23fc51e9d71d 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -168,7 +168,8 @@
>  #define OPAL_INT_SET_MFRR125
>  #define OPAL_PCI_TCE_KILL126
>  #define OPAL_NMMU_SET_PTCR   127
> -#define OPAL_LAST127
> +#define OPAL_NEST_IMC_COUNTERS_CONTROL   149

A couple of things:

 - why is this 149 rather than 128?

 - CONTROL seems like it's inviting a very broad and under-specified
   API. I notice most of the other opal calls are reasonably narrow:
   often defining two calls for get/set rather than a single control
   call.

I don't have cycles to review the OPAL/skiboot side and I'm very much
open to being convinced here, I just want to avoid the situation where
we're passing around complicated data structures in a way that is
difficult to synchronise if we could avoid it.

> +#define OPAL_LAST149
>  
>  /* Device tree flags */
>  
> @@ -928,6 +929,16 @@ enum {
>   OPAL_PCI_TCE_KILL_ALL,
>  };
>  
> +/* Operation argument to OPAL_NEST_IMC_COUNTERS_CONTROL */
> +enum {
> + OPAL_NEST_IMC_PRODUCTION_MODE = 1,
> +};
If there's only one mode, why do we need to specify it? As far as I can
tell no extra mode is added later in the series...

... looking at the opal-side patches, would it be better to just specify
the modes you intend to support in future here, and rely on opal
returning OPAL_PARAMETER when they're not supported?
> +
> +enum {
> + OPAL_NEST_IMC_STOP,
> + OPAL_NEST_IMC_START,
> +};
Again, would it be better to have a stop/start call rather than a
generic control call? 

Also, the enum values (STOP=0, START=1) seem to be switched vs OPAL,
where https://patchwork.ozlabs.org/patch/746292/ has STOP=1, START=0. Or
am I missing something?

> +
>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* __OPAL_API_H */
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 1ff03a6da76e..d93d08204243 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -227,6 +227,9 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t 
> kill_type,
> uint64_t dma_addr, uint32_t npages);
>  int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);
>  
> +int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
> + uint64_t value2, uint64_t value3);
> +

This prototype does not make me feel better about the state of the API.

Looking at the opal side, I would be much more comfortable if you could
nail down what you intend to have these support now, even if OPAL bails
with OPAL_PARAMETER if they're specified.

Also I think u64 and u32 are preferred, although I see the rest of the
file uses the long form.

>  /* Internal functions */
>  extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
>  int depth, void *data);
> diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
> index bd5bf66bd920..b28835b861b3 100644
> --- a/arch/powerpc/perf/imc-pmu.c
> +++ b/arch/powerpc/perf/imc-pmu.c
> @@ -17,6 +17,7 @@
>  
>  struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
>  struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
> +static cpumask_t nest_imc_cpumask;
>  
>  /* Needed for sanity check */
>  extern u64 nest_max_offset;
> @@ -32,6 +33,152 @@ static struct attribute_group imc_format_group = {
>   .attrs = imc_format_attrs,
>  };
>  
> +/* Get the cpumask printed to a buffer "buf" */
> +static ssize_t imc_pmu_cpumask_get_attr(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + cpumask_t *active_mask;
> +
> + active_mask = &nest_imc_cpumask;
> + return cpumap_print_to_pagebuf(true, buf, active_mask);
> +}
> +
> +static DEVICE_

Re: [PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority

2017-04-03 Thread Vaidyanathan Srinivasan
* Nicholas Piggin  [2017-04-04 09:52:07]:

> On Tue,  4 Apr 2017 07:54:12 +1000
> Anton Blanchard  wrote:
> 
> > From: Anton Blanchard 
> > 
> > The core of snooze_loop() continually bounces between low and very
> > low thread priority. Changing thread priorities is an expensive
> > operation that can negatively impact other threads on a core.
> > 
> > All CPUs that can run PowerNV support very low priority, so we can
> > avoid the change completely.
> 
> This looks good. I have HMT_lowest() which does alt feature patching
> we can use for pseries and default idle code.

Alternatively, if we are going to set priority only once in various
other places, HMT_low(); HMT_very_low(); should not add to extra
cycles.  Let me code that up.

--Vaidy

 



Re: [RFC][PATCH] spin loop arch primitives for busy waiting

2017-04-03 Thread Nicholas Piggin
On Tue, 4 Apr 2017 13:02:33 +1000
Nicholas Piggin  wrote:

> On Mon, 3 Apr 2017 17:43:05 -0700
> Linus Torvalds  wrote:
> 

> > But that depends on architectures having some pattern that we *can*
> > abstract. Would some "begin/in-loop/end" pattern like the above be
> > sufficient?  
> 
> Yes. begin/in/end would be sufficient for powerpc SMT priority, and
> for x86, and it looks like sparc64 too. So we could do that if you
> prefer.

How's this? I changed your name a bit just so we have a common spin_
prefix. With example powerpc implementation and one caller converted
to see the effect.

---
 arch/powerpc/include/asm/processor.h | 17 +
 include/linux/processor.h| 48 
 kernel/sched/idle.c  |  7 +-
 3 files changed, 71 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/processor.h

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index e9bbd450d966..1274dc818e74 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -402,6 +402,23 @@ static inline unsigned long __pack_fe01(unsigned int 
fpmode)
 
 #ifdef CONFIG_PPC64
 #define cpu_relax()do { HMT_low(); HMT_medium(); barrier(); } while (0)
+
+#ifndef spin_begin
+#define spin_begin()   HMT_low()
+#endif
+
+#ifndef spin_cpu_relax
+#define spin_cpu_relax()   barrier()
+#endif
+
+#ifndef spin_cpu_yield
+#define spin_cpu_yield()
+#endif
+
+#ifndef spin_end
+#define spin_end() HMT_medium()
+#endif
+
 #else
 #define cpu_relax()barrier()
 #endif
diff --git a/include/linux/processor.h b/include/linux/processor.h
new file mode 100644
index ..65e5635d0069
--- /dev/null
+++ b/include/linux/processor.h
@@ -0,0 +1,48 @@
+/* Misc low level processor primitives */
+#ifndef _LINUX_PROCESSOR_H
+#define _LINUX_PROCESSOR_H
+
+#include 
+
+/*
+ * spin_begin is used before beginning a busy-wait loop, and must be paired
+ * with spin_end when the loop is exited. spin_cpu_relax must be called
+ * within the loop.
+ *
+ * These loop body should be as small and fast as possible, on the order of
+ * tens of instructions/cycles as a guide. It should and avoid calling
+ * cpu_relax, or any "spin" or sleep type of primitive including nested uses
+ * of these primitives. It should not lock or take any other resource.
+ * Violations of this will not cause a bug, but may cause sub optimal
+ * performance.
+ *
+ * These loops are optimized to be used where wait times are expected to be
+ * less than the cost of a context switch (and associated overhead).
+ *
+ * Detection of resource owner and decision to spin or sleep or guest-yield
+ * (e.g., spin lock holder vcpu preempted, or mutex owner not on CPU) can be
+ * tested within the busy loop body if necessary.
+ */
+#ifndef spin_begin
+#define spin_begin()
+#endif
+
+#ifndef spin_cpu_relax
+#define spin_cpu_relax() cpu_relax()
+#endif
+
+/*
+ * spin_cpu_yield may be called to yield (undirected) to the hypervisor if
+ * necessary. This should be used if the wait is expected to take longer
+ * than context switch overhead, but we can't sleep or do a directed yield.
+ */
+#ifndef spin_cpu_yield
+#define spin_cpu_yield() cpu_relax_yield()
+#endif
+
+#ifndef spin_end
+#define spin_end()
+#endif
+
+#endif /* _LINUX_PROCESSOR_H */
+
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ac6d5176463d..99a032d9f4a9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -63,9 +64,13 @@ static noinline int __cpuidle cpu_idle_poll(void)
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
stop_critical_timings();
+
+   spin_begin();
while (!tif_need_resched() &&
(cpu_idle_force_poll || tick_check_broadcast_expired()))
-   cpu_relax();
+   spin_cpu_relax();
+   spin_end();
+
start_critical_timings();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
rcu_idle_exit();
-- 
2.11.0



Re: [PATCH 3/3] cpuidle: powernv: Avoid a branch in the core snooze_loop() loop

2017-04-03 Thread Vaidyanathan Srinivasan
* Anton Blanchard  [2017-04-04 07:54:14]:

> From: Anton Blanchard 
> 
> When in the snooze_loop() we want to take up the least amount of
> resources. On my version of gcc (6.3), we end up with an extra
> branch because it predicts snooze_timeout_en to be false, whereas it
> is almost always true.

By default snooze_timeout_en is true.  It will become false only when
we do not want to exit the snooze loop and that will be when snooze is
the only idle state available in the platform, which is a rare case.

> Use likely() to avoid the branch and be a little nicer to the
> other non idle threads on the core.
> 
> Signed-off-by: Anton Blanchard 

Reviewed-by: Vaidyanathan Srinivasan 

> ---
>  drivers/cpuidle/cpuidle-powernv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index 8c991c254b95..251a60bfa8ee 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -58,7 +58,7 @@ static int snooze_loop(struct cpuidle_device *dev,
>   ppc64_runlatch_off();
>   HMT_very_low();
>   while (!need_resched()) {
> - if (snooze_timeout_en && get_tb() > snooze_exit_time)
> + if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time)
>   break;
>   }
> 
> -- 
> 2.11.0
> 



Re: [PATCH 2/3] cpuidle: powernv: Don't continually set thread priority in snooze_loop()

2017-04-03 Thread Vaidyanathan Srinivasan
* Anton Blanchard  [2017-04-04 07:54:13]:

> From: Anton Blanchard 
> 
> The powerpc64 kernel exception handlers have preserved thread priorities
> for a long time now, so there is no need to continually set it.
> 
> Just set it once on entry and once exit.
> 
> Signed-off-by: Anton Blanchard 

Reviewed-by: Vaidyanathan Srinivasan 

> ---
>  drivers/cpuidle/cpuidle-powernv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index 9d9f164894eb..8c991c254b95 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -56,8 +56,8 @@ static int snooze_loop(struct cpuidle_device *dev,
> 
>   snooze_exit_time = get_tb() + snooze_timeout;
>   ppc64_runlatch_off();
> + HMT_very_low();
>   while (!need_resched()) {
> - HMT_very_low();
>   if (snooze_timeout_en && get_tb() > snooze_exit_time)
>   break;
>   }
> -- 
> 2.11.0
> 



Re: [PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority

2017-04-03 Thread Vaidyanathan Srinivasan
* Anton Blanchard  [2017-04-04 07:54:12]:

> From: Anton Blanchard 
> 
> The core of snooze_loop() continually bounces between low and very
> low thread priority. Changing thread priorities is an expensive
> operation that can negatively impact other threads on a core.
> 
> All CPUs that can run PowerNV support very low priority, so we can
> avoid the change completely.
> 
> Signed-off-by: Anton Blanchard 

Reviewed-by: Vaidyanathan Srinivasan 

> ---
>  drivers/cpuidle/cpuidle-powernv.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index cda8f62d555b..9d9f164894eb 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -57,7 +57,6 @@ static int snooze_loop(struct cpuidle_device *dev,
>   snooze_exit_time = get_tb() + snooze_timeout;
>   ppc64_runlatch_off();
>   while (!need_resched()) {
> - HMT_low();
>   HMT_very_low();
>   if (snooze_timeout_en && get_tb() > snooze_exit_time)
>   break;


HMT_low() is legacy and can be removed for powernv platforms.

--Vaidy



Re: [PATCH v6 05/11] powerpc/perf: Generic imc pmu event functions

2017-04-03 Thread Daniel Axtens
Hi,

So a major complaint I have is that you're changing prototypes of
functions from earlier patches.

This makes my life a lot harder: I get my head around what a function is
and does and then suddenly the prototype changes, the behaviour changes,
and I have to re-evaluate everything I thought I knew about the
function. The diffs are also usually quite unhelpful.

It would be far better, from my point of view, to squash related commits
together. You're adding a large-ish driver: we might as well review one
large, complete commit rather than several smaller incomplete commits.

There are a lot of interrelated benefits to this - just from the patches
I've reviewed so far, if there were fewer, larger commits then:

 - I would only have to read a function once to get a full picture of
   what it does

 - comments in function headers wouldn't get out of sync with function
   bodies

 - there would be less movement of variables from file to file

 - earlier comments wouldn't be invalidated by things I learn later. For
   example I suggested different names for imc_event_info_{str,val}
   based on their behaviour when first added in patch 3. Here you change
   the behaviour of imc_event_info_val - it would have been helpful to
   see the fuller picture when I was first reviewing as I would have
   been able to make more helpful suggestions.

and so on.

Anyway, further comments in line.

> From: Hemant Kumar 
>
> Since, the IMC counters' data are periodically fed to a memory location,
> the functions to read/update, start/stop, add/del can be generic and can
> be used by all IMC PMU units.
>
> This patch adds a set of generic imc pmu related event functions to be
> used  by each imc pmu unit. Add code to setup format attribute and to
> register imc pmus. Add a event_init function for nest_imc events.
>
> Signed-off-by: Anju T Sudhakar 
> Signed-off-by: Hemant Kumar 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/include/asm/imc-pmu.h|   1 +
>  arch/powerpc/perf/imc-pmu.c   | 137 
> ++
>  arch/powerpc/platforms/powernv/opal-imc.c |  30 ++-
>  3 files changed, 164 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/imc-pmu.h 
> b/arch/powerpc/include/asm/imc-pmu.h
> index a3d4f1bf9492..e13f51047522 100644
> --- a/arch/powerpc/include/asm/imc-pmu.h
> +++ b/arch/powerpc/include/asm/imc-pmu.h
> @@ -65,4 +65,5 @@ struct imc_pmu {
>  #define IMC_DOMAIN_NEST  1
>  #define IMC_DOMAIN_UNKNOWN   -1
>  
> +int imc_get_domain(struct device_node *pmu_dev);
>  #endif /* PPC_POWERNV_IMC_PMU_DEF_H */
> diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
> index ec464c76b749..bd5bf66bd920 100644
> --- a/arch/powerpc/perf/imc-pmu.c
> +++ b/arch/powerpc/perf/imc-pmu.c
> @@ -18,6 +18,132 @@
>  struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
>  struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
>  
> +/* Needed for sanity check */
> +extern u64 nest_max_offset;
Should this be in a header file?

> +
> +PMU_FORMAT_ATTR(event, "config:0-20");
> +static struct attribute *imc_format_attrs[] = {
> + &format_attr_event.attr,
> + NULL,
> +};
> +
> +static struct attribute_group imc_format_group = {
> + .name = "format",
> + .attrs = imc_format_attrs,
> +};
> +
> +static int nest_imc_event_init(struct perf_event *event)
> +{
> + int chip_id;
> + u32 config = event->attr.config;
> + struct perchip_nest_info *pcni;
> +
> + if (event->attr.type != event->pmu->type)
> + return -ENOENT;
> +
> + /* Sampling not supported */
> + if (event->hw.sample_period)
> + return -EINVAL;
> +
> + /* unsupported modes and filters */
> + if (event->attr.exclude_user   ||
> + event->attr.exclude_kernel ||
> + event->attr.exclude_hv ||
> + event->attr.exclude_idle   ||
> + event->attr.exclude_host   ||
> + event->attr.exclude_guest)
> + return -EINVAL;
> +
> + if (event->cpu < 0)
> + return -EINVAL;
> +
> + /* Sanity check for config (event offset) */
> + if (config > nest_max_offset)
> + return -EINVAL;
config is a u32, nest_max_offset is a u64. Is there a reason for this?
(Also, config is an unintuitive name for the local variable - the data
is stored in the attribute config space but the value represents an
offset into the reserved memory region.)

> +
> + chip_id = topology_physical_package_id(event->cpu);
> + pcni = &nest_perchip_info[chip_id];
> +
> + /*
> +  * Memory for Nest HW counter data could be in multiple pages.
> +  * Hence check and pick the right base page from the event offset,
> +  * and then add to it.
> +  */
> + event->hw.event_base = pcni->vbase[config/PAGE_SIZE] +
> + (config & ~PAGE_MASK);
> +
> + return 0;
> +}
> +
> +static void imc_read_counter(struct perf_event *even

Re: [RFC][PATCH] spin loop arch primitives for busy waiting

2017-04-03 Thread Nicholas Piggin
On Mon, 3 Apr 2017 17:43:05 -0700
Linus Torvalds  wrote:

> On Mon, Apr 3, 2017 at 4:50 PM, Nicholas Piggin  wrote:

> > If you have any ideas, I'd be open to them.  
> 
> So the idea would be that maybe we can just make those things
> explicit. IOW, instead of having that magical looping construct that
> does other magical hidden things as part of the loop, maybe we can
> just have a
> 
>begin_cpu_relax();
>while (!cond)
>cpu_relax();
>end_cpu_relax();
> 
> and then architectures can decide how they implement it. So for x86,
> the begin/end macros would be empty. For ppc, maybe begin/end would be
> the "lower and raise priority", while cpu_relax() itself is an empty
> thing.
> 
> Or maybe "begin" just clears a counter, while "cpu_relax()" does some
> "increase iterations, and lower priority after X iterations", and then
> "end" raises the priority again.
> 
> The "do magic having a special loop" approach disturbs me. I'd much
> rather have more explicit hooks that allow people to do their own loop
> semantics (including having a "return" to exit early).

I guess so. Users will still have to read the documentation rather than
just throw it in ad hoc because it seems like a good idea.

For example powerpc must not use any other primitives that might change
SMT priority in the idle loop. For x86 if you do too much work, the rep
; nop component becomes relatively small and you lose benefit (10 cycles
latency pre-skylake).

I would suggest keeping standalone cpu_relax() for incidental code in
drivers and things (powerpc may add some extra nops between SMT low
and SMT normal priorities to improve it a little), and this would
be used by important core code and primitives.

> But that depends on architectures having some pattern that we *can*
> abstract. Would some "begin/in-loop/end" pattern like the above be
> sufficient?

Yes. begin/in/end would be sufficient for powerpc SMT priority, and
for x86, and it looks like sparc64 too. So we could do that if you
prefer.

After this, I might look at optimizing the loop code itself (optimizing
exit condition, de-pipelining, etc). That would require spin loop
primitives like this again. BUT they would not have funny restrictions
on exit conditions because they wouldn't do SMT priority. If I get
positive numbers for that, would you be opposed to having such
primitives (just for the important core spin loops)?

> I think s390 might have issues too, since they tried to have that
> "cpu_relax_yield" thing (which is only used by stop_machine), and
> they've tried cpu_relax_lowlatency() and other games.

There are some considerations with hv/guest yielding, yes. I'm not
touching that yet, but it needs to be looked at. Generic code has no
chance of looking at primitives available and deciding which should
be best used (not least because they aren't documented).

I think for the most part, busy loops shouldn't be done if they tend
to be more expensive than a context switch + associated cache costs,
so hypervisors are okay there. It's just rare cases where we *have*
to spin. Directed spinning or yielding for a resource is possibly more
general pattern and something to look at adding to these spin APIs, but
for now they should be okay to just remain within the loop body.

Summary: hypervisor guests should not be affected by the new
APIs, but we could look at augmenting them later with some hv hints.

Thanks,
Nick


Re: [PATCH v6 04/11] powerpc/perf: Add event attribute and group to IMC pmus

2017-04-03 Thread Daniel Axtens
Hi,

> Device tree IMC driver code parses the IMC units and their events. It
> passes the information to IMC pmu code which is placed in powerpc/perf
> as "imc-pmu.c".
>
> This patch creates only event attributes and attribute groups for the
> IMC pmus.
>
> Signed-off-by: Anju T Sudhakar 
> Signed-off-by: Hemant Kumar 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/perf/Makefile|  6 +-
>  arch/powerpc/perf/imc-pmu.c   | 97 
> +++
>  arch/powerpc/platforms/powernv/opal-imc.c | 12 +++-
>  3 files changed, 112 insertions(+), 3 deletions(-)
>  create mode 100644 arch/powerpc/perf/imc-pmu.c
>
> diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
> index 4d606b99a5cb..d0d1f04203c7 100644
> --- a/arch/powerpc/perf/Makefile
> +++ b/arch/powerpc/perf/Makefile
> @@ -2,10 +2,14 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
>  
>  obj-$(CONFIG_PERF_EVENTS)+= callchain.o perf_regs.o
>  
> +imc-$(CONFIG_PPC_POWERNV)   += imc-pmu.o
> +
>  obj-$(CONFIG_PPC_PERF_CTRS)  += core-book3s.o bhrb.o
>  obj64-$(CONFIG_PPC_PERF_CTRS)+= power4-pmu.o ppc970-pmu.o 
> power5-pmu.o \
>  power5+-pmu.o power6-pmu.o power7-pmu.o \
> -isa207-common.o power8-pmu.o power9-pmu.o
> +isa207-common.o power8-pmu.o power9-pmu.o \
> +$(imc-y)

Hmm, this seems... obtuse. Will the IMC stuff fail to compile on
non-powerNV? Can we just link it in like we do with every other sort of
performance counter and let runtime detection do its thing?

> +
>  obj32-$(CONFIG_PPC_PERF_CTRS)+= mpc7450-pmu.o
>  
>  obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
> diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
> new file mode 100644
> index ..ec464c76b749
> --- /dev/null
> +++ b/arch/powerpc/perf/imc-pmu.c
> @@ -0,0 +1,97 @@
> +/*
> + * Nest Performance Monitor counter support.
> + *
> + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
> + *(C) 2016 Hemant K Shaw, IBM Corporation.
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
> +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
> +
> +/* dev_str_attr : Populate event "name" and string "str" in attribute */
> +static struct attribute *dev_str_attr(const char *name, const char *str)
> +{
> + struct perf_pmu_events_attr *attr;
> +
> + attr = kzalloc(sizeof(*attr), GFP_KERNEL);
> +
> + sysfs_attr_init(&attr->attr.attr);
> +
> + attr->event_str = str;
> + attr->attr.attr.name = name;
> + attr->attr.attr.mode = 0444;
> + attr->attr.show = perf_event_sysfs_show;
> +
> + return &attr->attr.attr;
> +}
> +
> +/*
> + * update_events_in_group: Update the "events" information in an attr_group
> + * and assign the attr_group to the pmu "pmu".
> + */
> +static int update_events_in_group(struct imc_events *events,
> +   int idx, struct imc_pmu *pmu)

So I've probably missed a key point in the earlier patches, but for the
life of me I cannot figure out what idx is supposed to represent.

> +{
> + struct attribute_group *attr_group;
> + struct attribute **attrs;
> + int i;
> +
> + /* Allocate memory for attribute group */
> + attr_group = kzalloc(sizeof(*attr_group), GFP_KERNEL);
> + if (!attr_group)
> + return -ENOMEM;
> +
> + /* Allocate memory for attributes */
> + attrs = kzalloc((sizeof(struct attribute *) * (idx + 1)), GFP_KERNEL);

Also, idx is an int, so:
 - things will go wrong if idx is -1
 - things will go very wrong if idx is -2
 - do you need to do some sort of validation to make sure it's within
   bounds? If not in this function, what function ensures the necessary
   'sanity check' preconditions for idx?

> + if (!attrs) {
> + kfree(attr_group);
> + return -ENOMEM;
> + }
> +
> + attr_group->name = "events";
> + attr_group->attrs = attrs;
> + for (i = 0; i < idx; i++, events++) {
> + attrs[i] = dev_str_attr((char *)events->ev_name,
> + (char *)events->ev_value);
> + }
> +
> + pmu->attr_groups[0] = attr_group;

Again, I may have missed something, but what's special about group 0?
Patch 1 lets you have up to 4 groups - are they used elsewhere?

> + return 0;
> +}
> +
> +/*
> + * init_imc_pmu : Setup the IMC pmu device in "pmu_ptr" and its events
> + *"events".
> + * Setup the cpu mask information for these pmus and setup the state machine
> + * hotplug notifiers as well.

I cannot

Re: [PATCH v6 01/11] powerpc/powernv: Data structure and macros definitions

2017-04-03 Thread Daniel Axtens
Hi,

> +#define IMC_MAX_CHIPS32
> +#define IMC_MAX_PMUS 32
> +#define IMC_MAX_PMU_NAME_LEN 256
I've noticed this is used as both the maximum length for event names and
event value strings. Would another name suit better?

> +
> +#define IMC_NEST_MAX_PAGES   16
> +
> +#define IMC_DTB_COMPAT   "ibm,opal-in-memory-counters"
> +#define IMC_DTB_NEST_COMPAT  "ibm,imc-counters-nest"
> +
> +/*
> + * Structure to hold per chip specific memory address
> + * information for nest pmus. Nest Counter data are exported
> + * in per-chip reserved memory region by the PORE Engine.
> + */
> +struct perchip_nest_info {
> + u32 chip_id;
> + u64 pbase;
> + u64 vbase[IMC_NEST_MAX_PAGES];
> + u64 size;
> +};
> +
> +/*
> + * Place holder for nest pmu events and values.
> + */
> +struct imc_events {
> + char *ev_name;
> + char *ev_value;
> +};
> +
> +/*
> + * Device tree parser code detects IMC pmu support and
> + * registers new IMC pmus. This structure will
> + * hold the pmu functions and attrs for each imc pmu and
> + * will be referenced at the time of pmu registration.
> + */
> +struct imc_pmu {
> + struct pmu pmu;
> + int domain;
> + const struct attribute_group *attr_groups[4];
> +};
> +
> +/*
> + * Domains for IMC PMUs
> + */
> +#define IMC_DOMAIN_NEST  1
> +#define IMC_DOMAIN_UNKNOWN   -1
> +
> +#endif /* PPC_POWERNV_IMC_PMU_DEF_H */
> -- 
> 2.7.4


Re: [PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module

2017-04-03 Thread Daniel Axtens
Hi,

> + do {
> + pages = PAGE_SIZE * i;
> + pcni->vbase[i++] = (u64)phys_to_virt(pcni->pbase +
> +  pages);
> + } while (i < (pcni->size / PAGE_SIZE));

I also just noticed that there's no check here against
IMC_NEST_MAX_PAGES: should that be inserted? (If for no other reason
than to stop every static analysis tool complaining!)

Daniel

> + }
> +
> + return 0;
> +err:
> + return -ENODEV;
> +}
> +
> +static const struct of_device_id opal_imc_match[] = {
> + { .compatible = IMC_DTB_COMPAT },
> + {},
> +};
> +
> +static struct platform_driver opal_imc_driver = {
> + .driver = {
> + .name = "opal-imc-counters",
> + .of_match_table = opal_imc_match,
> + },
> + .probe = opal_imc_counters_probe,
> +};
> +
> +MODULE_DEVICE_TABLE(of, opal_imc_match);
> +module_platform_driver(opal_imc_driver);
> +MODULE_DESCRIPTION("PowerNV OPAL IMC driver");
> +MODULE_LICENSE("GPL");
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index e0f856bfbfe8..85ea1296f030 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -30,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "powernv.h"
>  
> @@ -631,6 +633,15 @@ static void opal_pdev_init(const char *compatible)
>   of_platform_device_create(np, NULL, NULL);
>  }
>  
> +static void opal_imc_init_dev(void)
> +{
> + struct device_node *np;
> +
> + np = of_find_compatible_node(NULL, NULL, IMC_DTB_COMPAT);
> + if (np)
> + of_platform_device_create(np, NULL, NULL);
> +}
> +
>  static int kopald(void *unused)
>  {
>   unsigned long timeout = msecs_to_jiffies(opal_heartbeat) + 1;
> @@ -704,6 +715,9 @@ static int __init opal_init(void)
>   /* Setup a heatbeat thread if requested by OPAL */
>   opal_init_heartbeat();
>  
> + /* Detect IMC pmu counters support and create PMUs */
> + opal_imc_init_dev();
> +
>   /* Create leds platform devices */
>   leds = of_find_node_by_path("/ibm,opal/leds");
>   if (leds) {
> -- 
> 2.7.4


Re: [PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events

2017-04-03 Thread Daniel Axtens
Madhavan Srinivasan  writes:

> From: Hemant Kumar 
>
> Parse device tree to detect IMC units. Traverse through each IMC unit
> node to find supported events and corresponding unit/scale files (if any).
>
> Here is the DTS file for reference:
>
>   
> https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts
>
> The device tree for IMC counters starts at the node "imc-counters".
> This node contains all the IMC PMU nodes and event nodes
> for these IMC PMUs. The PMU nodes have an "events" property which has a
> phandle value for the actual events node. The events are separated from
> the PMU nodes to abstract out the common events. For example, PMU node
> "mcs0", "mcs1" etc. will contain a pointer to "nest-mcs-events" since,
> the events are common between these PMUs. These events have a different
> prefix based on their relation to different PMUs, and hence, the PMU
> nodes themselves contain an "events-prefix" property. The value for this
> property concatenated to the event name, forms the actual event
> name. Also, the PMU have a "reg" field as the base offset for the events
> which belong to this PMU. This "reg" field is added to an event in the
> "events" node, which gives us the location of the counter data. Kernel
> code uses this offset as event configuration value.

This is probably because I haven't looked at this area recently, but
what do you mean by 'event configuration value'? If I understand
correctly, you're saying something like "kernel code stores this
calculated location in the event configuration data" - is that about
right?

Apologies if this is a dumb question.

>
> Device tree parser code also looks for scale/unit property in the event
> node and passes on the value as an event attr for perf interface to use
> in the post processing by the perf tool. Some PMUs may have common scale
> and unit properties which implies that all events supported by this PMU
> inherit the scale and unit properties of the PMU itself. For those
> events, we need to set the common unit and scale values.
>
> For failure to initialize any unit or any event, disable that unit and
> continue setting up the rest of them.
>
> Signed-off-by: Hemant Kumar 
> Signed-off-by: Anju T Sudhakar 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/platforms/powernv/opal-imc.c | 386 
> ++
>  1 file changed, 386 insertions(+)
>
> diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
> b/arch/powerpc/platforms/powernv/opal-imc.c
> index c476d596c6a8..ba0203e669c0 100644
> --- a/arch/powerpc/platforms/powernv/opal-imc.c
> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> @@ -33,6 +33,388 @@
>  #include 
>  
>  struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
> +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
> +
> +static int imc_event_info(char *name, struct imc_events *events)
This seems to be an allocation/init function: should that be reflected
in the name?

> +{
> + char *buf;
> +
> + /* memory for content */
> + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + events->ev_name = name;
We trust the name will never become a dangling pointer? We don't need to
duplicate it?

Should the pointer be const?

> + events->ev_value = buf;
> + return 0;
> +}
> +
> +static int imc_event_info_str(struct property *pp, char *name,
> +struct imc_events *events)
This creates and sets an initial value based on a property, ...

> +static int imc_event_info_val(char *name, u32 val,
> +   struct imc_events *events)
... this creates and sets based on a u32.

Could you call them something that suggests their purpose a bit better?
Maybe event_info_from_{property,val}? 

> +{
> + int ret;
> +
> + ret = imc_event_info(name, events);
> + if (ret)
> + return ret;
> + sprintf(events->ev_value, "event=0x%x", val);
> +
> + return 0;
> +}
> +
> +static int set_event_property(struct property *pp, char *event_prop,
> +   struct imc_events *events, char *ev_name)
> +{
> + char *buf;
> + int ret;
> +
> + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + sprintf(buf, "%s.%s", ev_name, event_prop);
> + ret = imc_event_info_str(pp, buf, events);
> + if (ret) {
> + kfree(events->ev_name);
> + kfree(events->ev_value);
How do you know the buffer for ev_value was successfully allocated? Can
you be sure it is safe to free? (Consider imc_event_info returning -ENOMEM)

> + }
> +
> + return ret;
> +}
> +
> +/*
> + * imc_events_node_parser: Parse the event node "dev" and assign the parsed
> + * information to event "events".
> + *
> + * Parses the "reg" property of this event. "reg" gives us the event offset.
> + * Also, parse the "scale" and "unit" properties, if any.
> + */
> +s

Re: [PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module

2017-04-03 Thread Daniel Axtens
Hi all,

I'm trying to get my head around these patches - at this point I'm just
doing a first pass, so I may have more substantive structural comments
later on. In the mean time - here are some minor C nits:

> + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
> + *   (C) 2016 Hemant K Shaw, IBM Corporation.

Should these be bumped to 2017?

> +
> + do {
> + pages = PAGE_SIZE * i;
> + pcni->vbase[i++] = (u64)phys_to_virt(pcni->pbase +
> +  pages);
> + } while (i < (pcni->size / PAGE_SIZE));
> + }
I had to scroll back up to the top of this function to make sure I
understood what this loop does. Would it be better to write it as:

for (i = 0; i < (pcni->size / PAGE_SIZE); i++) {
pages = PAGE_SIZE * i;
pcni->vbase[i] = (u64)
}

And, just checking - this is expected to work on both 4 and 64kB pages?

> +
> + return 0;
> +err:
> + return -ENODEV;

You're not releasing any resources here - would it be better to just
replace the gotos with this return? I haven't checked to see if you
change the function later on to allocate memory - if so please ignore :)

> +}
> +
> +static const struct of_device_id opal_imc_match[] = {
> + { .compatible = IMC_DTB_COMPAT },
> + {},
> +};
> +
> +static struct platform_driver opal_imc_driver = {
> + .driver = {
> + .name = "opal-imc-counters",
> + .of_match_table = opal_imc_match,
> + },
> + .probe = opal_imc_counters_probe,
> +};
> +
> +MODULE_DEVICE_TABLE(of, opal_imc_match);
> +module_platform_driver(opal_imc_driver);
> +MODULE_DESCRIPTION("PowerNV OPAL IMC driver");
> +MODULE_LICENSE("GPL");
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index e0f856bfbfe8..85ea1296f030 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -30,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "powernv.h"
>  
> @@ -631,6 +633,15 @@ static void opal_pdev_init(const char *compatible)
>   of_platform_device_create(np, NULL, NULL);
>  }
>  
> +static void opal_imc_init_dev(void)
> +{
> + struct device_node *np;
> +
> + np = of_find_compatible_node(NULL, NULL, IMC_DTB_COMPAT);
> + if (np)
> + of_platform_device_create(np, NULL, NULL);
> +}

Should this function be tagged __init?

> +
>  static int kopald(void *unused)
>  {
>   unsigned long timeout = msecs_to_jiffies(opal_heartbeat) + 1;
> @@ -704,6 +715,9 @@ static int __init opal_init(void)
>   /* Setup a heatbeat thread if requested by OPAL */
>   opal_init_heartbeat();
>  
> + /* Detect IMC pmu counters support and create PMUs */
> + opal_imc_init_dev();
> +
>   /* Create leds platform devices */
>   leds = of_find_node_by_path("/ibm,opal/leds");
>   if (leds) {
> -- 
> 2.7.4


Re: [RFC][PATCH] spin loop arch primitives for busy waiting

2017-04-03 Thread Linus Torvalds
On Mon, Apr 3, 2017 at 4:50 PM, Nicholas Piggin  wrote:
>
> POWER does not have an instruction like pause. We can only set current
> thread priority, and current implementations do something like allocate
> issue cycles to threads based on relative priorities. So there should
> be at least one or two issue cycles at low priority, but ideally we
> would not be changing priority in the busy-wait loop because it can
> impact other threads in the core.
>
> I couldn't think of a good way to improve cpu_relax. Our (open source)
> firmware has a cpu_relax, and it puts a bunch of nops between low and
> normal priority instructions so we get some fetch cycles at low prio.
> That isn't ideal though.
>
> If you have any ideas, I'd be open to them.

So the idea would be that maybe we can just make those things
explicit. IOW, instead of having that magical looping construct that
does other magical hidden things as part of the loop, maybe we can
just have a

   begin_cpu_relax();
   while (!cond)
   cpu_relax();
   end_cpu_relax();

and then architectures can decide how they implement it. So for x86,
the begin/end macros would be empty. For ppc, maybe begin/end would be
the "lower and raise priority", while cpu_relax() itself is an empty
thing.

Or maybe "begin" just clears a counter, while "cpu_relax()" does some
"increase iterations, and lower priority after X iterations", and then
"end" raises the priority again.

The "do magic having a special loop" approach disturbs me. I'd much
rather have more explicit hooks that allow people to do their own loop
semantics (including having a "return" to exit early).

But that depends on architectures having some pattern that we *can*
abstract. Would some "begin/in-loop/end" pattern like the above be
sufficient? The pure "in-loop" case we have now (ie "cpu_relax()"
clearly isn't sufficient.

I think s390 might have issues too, since they tried to have that
"cpu_relax_yield" thing (which is only used by stop_machine), and
they've tried cpu_relax_lowlatency() and other games.

Linus


Re: [RFC PATCH 4/5] dt-bindings: soc/fsl: Update reserved memory binding for QBMan

2017-04-03 Thread Scott Wood
On Mon, 2017-04-03 at 19:49 +, Roy Pledge wrote:
> On 4/3/2017 11:42 AM, Rob Herring wrote:
> > 
> > On Wed, Mar 29, 2017 at 05:13:56PM -0400, Roy Pledge wrote:
> > > 
> > > Updates the QMan and BMan device tree bindings for reserved memory
> > > nodes. This makes the reserved memory allocation compatiable with
> > s/compatiable/compatible/
> > 
> > > 
> > > the shared-dma-pool usage.
> > This change is not backwards compatible. Please state that and explain 
> > why that is okay. If PPC needs to not change, then the old strings and 
> > properties should remain, but deprecated.
> I think I can make the old device trees compatible since the
> "compatible" string changed without too much effort or ifdefery in the
> code.  However I would like to eventually see all PPC users move to the
> new mode as I do believe it is more in alignment with the spirit of the
> reserved-memory framework that is in the kernel.

How much benefit is there to changing PPC if you have to retain the old method
anyway for compatibility?  Whereas if you don't convert, you retain test
coverage for the old device trees, and don't have to worry about
the memblock_end_of_DRAM() questions.

>   Do I need to mention the old mode in the binding.txt file? 

Yes.

>  I'm trying to keep things clean
> so someone reading the binding doesn't get a headache and trying to
> preserve my own sanity as we move forward with adding new features to
> this driver.

If they're sufficiently different then just describe the new and old nodes
separately rather than putting a bunch of if/else clauses in there.

-Scott



Re: [RFC PATCH 2/5] soc/fsl/qbman: Use shared-dma-pool for QMan private memory allocations

2017-04-03 Thread Scott Wood
On Mon, 2017-04-03 at 15:52 +0100, Robin Murphy wrote:
> On 01/04/17 08:25, Scott Wood wrote:
> > 
> > On Fri, 2017-03-31 at 18:55 +0100, Robin Murphy wrote:
> > > 
> > > On 31/03/17 04:27, Michael Ellerman wrote:
> > > > 
> > > > 
> > > > Robin Murphy  writes:
> > > > 
> > > > > 
> > > > > 
> > > > > Hi Roy,
> > > > > 
> > > > > On 29/03/17 22:13, Roy Pledge wrote:
> > > > > > 
> > > > > > 
> > > > > > Use the shared-memory-pool mechanism for frame queue descriptor
> > > > > > and
> > > > > > packed frame descriptor record area allocations.
> > > > > Thanks for persevering with this - in my opinion it's now looking
> > > > > like
> > > > > it was worth the effort :)
> > > > > 
> > > > > AFAICS the ioremap_wc() that this leads to does appear to give back
> > > > > something non-cacheable on PPC (assuming "pgprot_noncached_wc" isn't
> > > > > horrendously misnamed), and "no-map" should rule out any cacheable
> > > > > linear map alias existing, so it would seem that this approach
> > > > > should
> > > > > avert Scott's concerns about attribute mismatches.
> > > > How does 'no-map' translate into something being excluded from the
> > > > linear mapping?
> > > Reserved regions marked with "no-map" get memblock_remove()d by
> > > early_init_dt_alloc_reserved_memory_arch(). As I understand things, the
> > > linear map should only cover memblock areas, and it would be explicitly
> > > violating the semantics of "no-map" to still cover such a region.
> > Discontiguous memory isn't supported on these PPC chips.  Everything up to
> > memblock_end_of_DRAM() gets mapped -- and if that were to change, the
> > fragmentation would waste TLB1 entries.
> Ah, so the "PPC-specific angles I'm not aware of" category is indeed
> non-empty - I guess the lack of HAVE_GENERIC_DMA_COHERENT might be
> related, then.
> 
> That said, though, AFAICS only certain x86 and s390 configurations ever
> call memblock_set_bottom_up(true), so we should be able to assume that
> the reserved region allocations always fall through to
> __memblock_find_range_top_down(). Thus if your DRAM is contiguous, then
> "no-map"-ing the reserved regions will simply end up pushing
> memblock_end_of_DRAM() down in a manner that would appear to still avoid
> overlaps.

Can you guarantee it will be at the end?  What if there were other early
memblock allocations (e.g. other reserved-memory regions without no-map) that
came first?

>  I can only see that going wrong if the end of DRAM wasn't at
> least 32MB aligned to begin with - is that ever likely to happen in
> practice?

Probably not, unless the user asks for an unusual memory size via mem= or
similar.

-Scott



Re: [PATCH 3/3] cpuidle: powernv: Avoid a branch in the core snooze_loop() loop

2017-04-03 Thread Nicholas Piggin
On Tue,  4 Apr 2017 07:54:14 +1000
Anton Blanchard  wrote:

> From: Anton Blanchard 
> 
> When in the snooze_loop() we want to take up the least amount of
> resources. On my version of gcc (6.3), we end up with an extra
> branch because it predicts snooze_timeout_en to be false, whereas it
> is almost always true.
> 
> Use likely() to avoid the branch and be a little nicer to the
> other non idle threads on the core.

Patches 2 and 3 look fine. Should they be replicated to cpuidle-pseries.c
as well?

Thanks,
Nick

> 
> Signed-off-by: Anton Blanchard 
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index 8c991c254b95..251a60bfa8ee 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -58,7 +58,7 @@ static int snooze_loop(struct cpuidle_device *dev,
>   ppc64_runlatch_off();
>   HMT_very_low();
>   while (!need_resched()) {
> - if (snooze_timeout_en && get_tb() > snooze_exit_time)
> + if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time)
>   break;
>   }
>  



Re: [PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority

2017-04-03 Thread Nicholas Piggin
On Tue,  4 Apr 2017 07:54:12 +1000
Anton Blanchard  wrote:

> From: Anton Blanchard 
> 
> The core of snooze_loop() continually bounces between low and very
> low thread priority. Changing thread priorities is an expensive
> operation that can negatively impact other threads on a core.
> 
> All CPUs that can run PowerNV support very low priority, so we can
> avoid the change completely.

This looks good. I have HMT_lowest() which does alt feature patching
we can use for pseries and default idle code.


Re: [RFC][PATCH] spin loop arch primitives for busy waiting

2017-04-03 Thread Nicholas Piggin
On Mon, 3 Apr 2017 08:31:30 -0700
Linus Torvalds  wrote:

> On Mon, Apr 3, 2017 at 1:13 AM, Nicholas Piggin  wrote:
> >
> > The loops have some restrictions on what can be used, but they are
> > intended to be small and simple so it's not generally a problem:
> >  - Don't use cpu_relax.
> >  - Don't use return or goto.
> >  - Don't use sleeping or spinning primitives.  
> 
> So you're supposed to "break" out of the loop if you want to exit
> early? Or what?

Yes, break (I put that as a comment somewhere).

> One of the issues is that with a do-while/until loop, at least the way
> you've coded it, it's always done at least once.
> 
> Which means that people will have to code the condition as
> 
> if (cond) {
> .. fast case..
> return;
> }
> 
> spin_do {
>...
> } spin_until (cond);
> .. slow case ..
> 
> because "cpu_relax()" itself can be slightly slow.
> 
> And the way you've done it, even if there's a "break" in the loop, the
> cpu_relax() is still done (because it's done at the top).
> 
> So quite frankly, I think "while(cond) ()" semantics would be better
> than "do { } while (cond)".
> 
> Now, a lot of loops *are* of the kind where we've already handled the
> fast case earlier, so by the time we get into the loop we really are
> waiting for the condition to become true (but we knew it started out
> false). But not all.

Yeah that's a good point, I see what you mean. The important cases I've
been looked at (e.g., mutex spinning, bit spinlock, idle loops) are the
slowpath cases. But this would need to be explicit in the API so fastpaths
don't use it.

> Doing a quick
> 
> git grep -2 cpu_relax
> 
> for existing users of cpu_relax() does imply that most of the current
> users are very much of the "cpu_relax() at the _end_ of the loop
> tests" type.
> 
> So I don't know. I think the interface sucks.

It's a bit clunky.

> What is it that POWER _actually_ wants? Not the loop - the
> "cpu_relax()" kind of thing. Can we abstract *that* better?

POWER does not have an instruction like pause. We can only set current
thread priority, and current implementations do something like allocate
issue cycles to threads based on relative priorities. So there should
be at least one or two issue cycles at low priority, but ideally we
would not be changing priority in the busy-wait loop because it can
impact other threads in the core.

I couldn't think of a good way to improve cpu_relax. Our (open source)
firmware has a cpu_relax, and it puts a bunch of nops between low and
normal priority instructions so we get some fetch cycles at low prio.
That isn't ideal though.

If you have any ideas, I'd be open to them.

I had a secondary motive for defining it as a loop, and that was using
branch hints or even asm goto to arrange the loop and exit branch
nicely. E.g., if we static predict loop exits, then we don't take a
branch miss as first thing when condition changes. This is a second
order concern though.

Thanks,
Nick


Re: [PATCH] powerpc/misc: fix exported functions that reference the TOC

2017-04-03 Thread Benjamin Herrenschmidt
On Mon, 2017-04-03 at 23:29 +1000, Michael Ellerman wrote:
> The other option would be just to make a rule that anything EXPORT'ed
> must use _GLOBAL_TOC().

Can we enforce that somewhat at build time ?

Cheers,
Ben.



Re: [PATCH] powerpc: Avoid taking a data miss on every userspace instruction miss

2017-04-03 Thread Anton Blanchard
Hi Christophe,

> > -   if (user_mode(regs))
> > +   if (!is_exec && user_mode(regs))  
> 
> Shouldn't it also check 'is_write' ?
> If it is a store, is_write should be set, shouldn't it ?

Thanks, Ben had the same suggestion. I'll add that further optimisation
in a subsequent patch.

Anton


[PATCH 3/3] cpuidle: powernv: Avoid a branch in the core snooze_loop() loop

2017-04-03 Thread Anton Blanchard
From: Anton Blanchard 

When in the snooze_loop() we want to take up the least amount of
resources. On my version of gcc (6.3), we end up with an extra
branch because it predicts snooze_timeout_en to be false, whereas it
is almost always true.

Use likely() to avoid the branch and be a little nicer to the
other non idle threads on the core.

Signed-off-by: Anton Blanchard 
---
 drivers/cpuidle/cpuidle-powernv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 8c991c254b95..251a60bfa8ee 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -58,7 +58,7 @@ static int snooze_loop(struct cpuidle_device *dev,
ppc64_runlatch_off();
HMT_very_low();
while (!need_resched()) {
-   if (snooze_timeout_en && get_tb() > snooze_exit_time)
+   if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time)
break;
}
 
-- 
2.11.0



[PATCH 2/3] cpuidle: powernv: Don't continually set thread priority in snooze_loop()

2017-04-03 Thread Anton Blanchard
From: Anton Blanchard 

The powerpc64 kernel exception handlers have preserved thread priorities
for a long time now, so there is no need to continually set it.

Just set it once on entry and once exit.

Signed-off-by: Anton Blanchard 
---
 drivers/cpuidle/cpuidle-powernv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 9d9f164894eb..8c991c254b95 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -56,8 +56,8 @@ static int snooze_loop(struct cpuidle_device *dev,
 
snooze_exit_time = get_tb() + snooze_timeout;
ppc64_runlatch_off();
+   HMT_very_low();
while (!need_resched()) {
-   HMT_very_low();
if (snooze_timeout_en && get_tb() > snooze_exit_time)
break;
}
-- 
2.11.0



[PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority

2017-04-03 Thread Anton Blanchard
From: Anton Blanchard 

The core of snooze_loop() continually bounces between low and very
low thread priority. Changing thread priorities is an expensive
operation that can negatively impact other threads on a core.

All CPUs that can run PowerNV support very low priority, so we can
avoid the change completely.

Signed-off-by: Anton Blanchard 
---
 drivers/cpuidle/cpuidle-powernv.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index cda8f62d555b..9d9f164894eb 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -57,7 +57,6 @@ static int snooze_loop(struct cpuidle_device *dev,
snooze_exit_time = get_tb() + snooze_timeout;
ppc64_runlatch_off();
while (!need_resched()) {
-   HMT_low();
HMT_very_low();
if (snooze_timeout_en && get_tb() > snooze_exit_time)
break;
-- 
2.11.0



Re: [PATCH] raid6/altivec: adding vpermxor implementation for raid6 Q syndrome

2017-04-03 Thread Daniel Axtens
> In that function, the flow is:
>  pagefault_disable();
>  enable_kernel_altivec();
>  
>  pagefault_enable();
>
> There are a few things that it would be nice (but by no means essential)
> to find out:
>  - what is the difference between pagefault and prempt enable/disable
>  - is it required to disable altivec after the end of the function or
>can we leave that enabled?

Answering my own question here, dc4fbba11e46 ("powerpc: Create
disable_kernel_{fp,altivec,vsx,spe}()") adds the disable_ function, and
it's a no-op except under debug conditions. So it should stay.

Regards,
Daniel


>
>> +
>> +int raid6_have_altivec_vpermxor(void);
>> +#if $# == 1
>> +int raid6_have_altivec_vpermxor(void)
>> +{
>> +/* Check if CPU has both altivec and the vpermxor instruction*/
> Please add a space: s|ion*/|ion */|
>> +# ifdef __KERNEL__
>> +return (cpu_has_feature(CONFIG_ALTIVEC) &&
>> +cpu_has_feature(CPU_FTR_ARCH_207S));
> I assume this is future-proof - an ISA 3.00 cpu will advertise 2.07S
> compat?
>
>> +# else
>> +return 1;
>> +#endif
>> +
>> +}
>> +#endif
>> +
>> +const struct raid6_calls raid6_vpermxor$# = {
>> +raid6_vpermxor$#_gen_syndrome,
>> +NULL,
>> +raid6_have_altivec_vpermxor,
>> +"vpermxor$#",
>> +0
>> +};
>> +#endif
>> -- 
>> 2.9.3
>
> Apart from that this patch looks good and I expect I will be able to
> formally Review v2.
>
> Regards,
> Daniel


Re: [RFC PATCH 4/5] dt-bindings: soc/fsl: Update reserved memory binding for QBMan

2017-04-03 Thread Roy Pledge
On 4/3/2017 11:42 AM, Rob Herring wrote:
> On Wed, Mar 29, 2017 at 05:13:56PM -0400, Roy Pledge wrote:
>> Updates the QMan and BMan device tree bindings for reserved memory
>> nodes. This makes the reserved memory allocation compatiable with
> s/compatiable/compatible/
>
>> the shared-dma-pool usage.
> This change is not backwards compatible. Please state that and explain 
> why that is okay. If PPC needs to not change, then the old strings and 
> properties should remain, but deprecated.
I think I can make the old device trees compatible since the
"compatible" string changed without too much effort or ifdefery in the
code.  However I would like to eventually see all PPC users move to the
new mode as I do believe it is more in alignment with the spirit of the
reserved-memory framework that is in the kernel.  Do I need to mention
the old mode in the binding.txt file?  I'm trying to keep things clean
so someone reading the binding doesn't get a headache and trying to
preserve my own sanity as we move forward with adding new features to
this driver.

Roy


Re: [PATCH] powerpc: Avoid taking a data miss on every userspace instruction miss

2017-04-03 Thread LEROY Christophe

Anton Blanchard  a écrit :


From: Anton Blanchard 

Early on in do_page_fault() we call store_updates_sp(), regardless of
the type of exception. For an instruction miss this doesn't make
sense, because we only use this information to detect if a data miss
is the result of a stack expansion instruction or not.

Worse still, it results in a data miss within every userspace
instruction miss handler, because we try and load the very instruction
we are about to install a pte for!

A simple exec microbenchmark runs 6% faster on POWER8 with this fix:

 #include 
 #include 
 #include 

int main(int argc, char *argv[])
{
unsigned long left = atol(argv[1]);
char leftstr[16];

if (left-- == 0)
return 0;

sprintf(leftstr, "%ld", left);
execlp(argv[0], argv[0], leftstr, NULL);
perror("exec failed\n");

return 0;
}

Pass the number of iterations on the command line (eg 1) and time
how long it takes to execute.

Signed-off-by: Anton Blanchard 
---
 arch/powerpc/mm/fault.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index fd6484fc2fa9..3a7d580fdc59 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -287,7 +287,7 @@ int do_page_fault(struct pt_regs *regs, unsigned  
long address,

 * can result in fault, which will cause a deadlock when called with
 * mmap_sem held
 */
-   if (user_mode(regs))
+   if (!is_exec && user_mode(regs))


Shouldn't it also check 'is_write' ?
If it is a store, is_write should be set, shouldn't it ?

Christophe


store_update_sp = store_updates_sp(regs);

if (user_mode(regs))
--
2.11.0





Re: [v10, 7/7] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

2017-04-03 Thread Kiran Kumar
Sorry to post in this huge email bunch.

I have most probably hit an errata in Freescale T4240 for
PPC_DISABLE_THREADS. I'm using Rev2 - T4240. Is this Errata required to be
taken care or not?  Any quick help is appreciated!

My issue: I'm running line rate of Traffic to T4240 [10G of traffic on each
port of capacity of 10G]; After few hours of running of traffic, My CPU/LMP
gets stuck and goes in for Hard reset. When I searched through the code and
some open forums, I saw the errata listed above. I'm not sure if that
errata works for me or not. I have pasted a snapshot of issue occuring in
my system



During hang, with Softlock up enabled, I get prints from smp_many ->
showing ‘every processor is waiting for processor 22’ ; I need to know what
happened to processor 22. The processor number keep changing every time
when I run the traffic.

Processor 22 goes into a deadlock state with interrupts disabled or it went
into a deep idle sleep state is the issue i feel.

If its deep idle state -> a NMI should have recovered it.

But if it’s a deadlock issue with interrupt disabled then I need to know
the root cause for the deadlock.

*root@A@0-1-1:~# [ 1602.261011] Current::17 waiting::22 flag::4353 [
1602.720488] Current::14 waiting::22 flag::3585 [ 1602.756404] Current::15
waiting::22 flag::3841 [ 1604.903248] Current::16 waiting::22 flag::4097 [
1619.400917] Current::14 waiting::22 flag::3585 [ 1619.517870] Current::17
waiting::22 flag::4353 [ 1619.749893] Current::15 waiting::22 flag::3841 [
1619.798453] Current::4 waiting::22 flag::1025 [ 1622.177171] Current::16
waiting::22 flag::4097 [ 1622.449412] INFO: rcu_preempt detected stalls on
CPUs/tasks: { 22} (detected by 4, t=21008 jiffies, g=101651, c=101650,
q=4713) [ 1622.460951] Task dump for CPU 22:*

*[ 1622.464275] swapper/22  R  running task0 0  1
0x0800*

*[ 1622.471355] Call Trace:*

*[ 1622.473820] [c001f92b78d0] [009e] 0x9e (unreliable) [
1622.480119] [c001f92b7960] [c001f92b7aa0] 0xc001f92b7aa0 [
1622.486497] [c001f92b79e0] [c0a90275] 0xc0a90275 [
1622.492878] [c001f92b7a60] [c0006f64] .do_IRQ+0x184/0x370 [
1622.499344] [c001f92b7b10] [c001b93c]
exc_0x500_common+0xfc/0x100 [ 1622.506539] --- Exception: 501 at
0xc0a3e200*

*[ 1622.506539] LR = .__check_irq_replay+0x68/0x110*

*[ 1622.516388] [c001f92b7e00] [c00bc590]
.cpu_startup_entry+0x1d0/0x350 (unreliable) [ 1622.524950]
[c001f92b7ed0] [c0a00460] .start_secondary+0x3ec/0x3f4 [
1622.532197] [c001f92b7f90] [c36c]
.start_secondary_prolog+0x10/0x14 [ 1635.237829] Current::3 waiting::5
flag::769 [ 1636.587746] Current::14 waiting::22 flag::3585 [ 1637.082117]
Current::17 waiting::22 flag::4353 [ 1637.224920] Current::4 waiting::22
flag::1025 [ 1637.268789] Current::15 waiting::22 flag::3841 [ 1640.085792]
Current::16 waiting::22 flag::4097 [ 1651.093367] Current::4 waiting::22
flag::1025*

*=*

Regards,

Kiran

On Thu, May 5, 2016 at 4:40 PM, Arnd Bergmann  wrote:

> On Thursday 05 May 2016 09:41:32 Yangbo Lu wrote:
> > > -Original Message-
> > > From: Arnd Bergmann [mailto:a...@arndb.de]
> > > Sent: Thursday, May 05, 2016 4:32 PM
> > > To: linuxppc-dev@lists.ozlabs.org
> > > Cc: Yangbo Lu; linux-...@vger.kernel.org; devicet...@vger.kernel.org;
> > > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> > > linux-...@vger.kernel.org; linux-...@vger.kernel.org;
> iommu@lists.linux-
> > > foundation.org; net...@vger.kernel.org; Mark Rutland;
> > > ulf.hans...@linaro.org; Russell King; Bhupesh Sharma; Joerg Roedel;
> > > Santosh Shilimkar; Yang-Leo Li; Scott Wood; Rob Herring; Claudiu
> Manoil;
> > > Kumar Gala; Xiaobo Xie; Qiang Zhao
> > > Subject: Re: [v10, 7/7] mmc: sdhci-of-esdhc: fix host version for
> T4240-
> > > R1.0-R2.0
> > >
> > > On Thursday 05 May 2016 11:12:30 Yangbo Lu wrote:
> > > >
> > > > +   fsl_guts_init();
> > > > +   svr = fsl_guts_get_svr();
> > > > +   if (svr) {
> > > > +   esdhc->soc_ver = SVR_SOC_VER(svr);
> > > > +   esdhc->soc_rev = SVR_REV(svr);
> > > > +   } else {
> > > > +   dev_err(&pdev->dev, "Failed to get SVR value!\n");
> > > > +   }
> > > > +
> > > >
> > >
> > >
> > > Sorry for jumping in again after not participating in the discussion
> for
> > > the past few versions.
> > >
> > > What happened to my suggestion of making this a platform-independent
> > > interface to avoid the link time dependency?
> > >
> > > Specifically, why not add an exported function to drivers/base/soc.c
> that
> > > uses glob_match() for comparing a string in the device driver to the ID
> > > of the SoC that is set by whatever SoC identifying driver the platform
> > > has?
> >
> > [Lu Yangbo-B47093] I think this has been discussed in v6.
> > You can find Scott's c

Re: [RFC PATCH 4/5] dt-bindings: soc/fsl: Update reserved memory binding for QBMan

2017-04-03 Thread Rob Herring
On Wed, Mar 29, 2017 at 05:13:56PM -0400, Roy Pledge wrote:
> Updates the QMan and BMan device tree bindings for reserved memory
> nodes. This makes the reserved memory allocation compatiable with

s/compatiable/compatible/

> the shared-dma-pool usage.

This change is not backwards compatible. Please state that and explain 
why that is okay. If PPC needs to not change, then the old strings and 
properties should remain, but deprecated.

> 
> Signed-off-by: Roy Pledge 
> ---
>  Documentation/devicetree/bindings/soc/fsl/bman.txt | 10 +-
>  Documentation/devicetree/bindings/soc/fsl/qman.txt | 16 +---
>  2 files changed, 14 insertions(+), 12 deletions(-)


[PATCH v6 11/11] powerpc/perf: Thread imc cpuhotplug support

2017-04-03 Thread Madhavan Srinivasan
From: Anju T Sudhakar 

This patch adds support for thread IMC on cpuhotplug.

When a cpu goes offline, the LDBAR for that cpu is disabled, and when it comes
back online the previous ldbar value is written back to the LDBAR for that cpu.

To register the hotplug functions for thread_imc, a new state
CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE is added to the list of existing
states.

Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Anju T Sudhakar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/perf/imc-pmu.c | 32 +++-
 include/linux/cpuhotplug.h  |  1 +
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 3db637c6d3f4..944e568b52ff 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -726,6 +726,16 @@ static void cleanup_all_thread_imc_memory(void)
on_each_cpu(cleanup_thread_imc_memory, NULL, 1);
 }
 
+static void thread_imc_update_ldbar(unsigned int cpu_id)
+{
+   u64 ldbar_addr, ldbar_value;
+
+   ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]);
+   ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) |
+   (u64)THREAD_IMC_ENABLE;
+   mtspr(SPRN_LDBAR, ldbar_value);
+}
+
 /*
  * Allocates a page of memory for each of the online cpus, and, writes the
  * physical base address of that page to the LDBAR for that cpu. This starts
@@ -733,21 +743,33 @@ static void cleanup_all_thread_imc_memory(void)
  */
 static void thread_imc_mem_alloc(void *dummy)
 {
-   u64 ldbar_addr, ldbar_value;
int cpu_id = smp_processor_id();
int phys_id = topology_physical_package_id(smp_processor_id());
 
per_cpu_add[cpu_id] = (u64)alloc_pages_exact_nid(phys_id,
(size_t)IMC_THREAD_COUNTER_MEM, GFP_KERNEL | 
__GFP_ZERO);
-   ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]);
-   ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) |
-   (u64)THREAD_IMC_ENABLE;
-   mtspr(SPRN_LDBAR, ldbar_value);
+   thread_imc_update_ldbar(cpu_id);
+}
+
+static int ppc_thread_imc_cpu_online(unsigned int cpu)
+{
+   thread_imc_update_ldbar(cpu);
+   return 0;
 }
 
+static int ppc_thread_imc_cpu_offline(unsigned int cpu)
+{
+   mtspr(SPRN_LDBAR, 0);
+   return 0;
+ }
+
 void thread_imc_cpu_init(void)
 {
on_each_cpu(thread_imc_mem_alloc, NULL, 1);
+   cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE,
+   "POWER_THREAD_IMC_ONLINE",
+   ppc_thread_imc_cpu_online,
+   ppc_thread_imc_cpu_offline);
 }
 
 /*
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index abde85d9511a..724df46b2c3c 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -138,6 +138,7 @@ enum cpuhp_state {
CPUHP_AP_PERF_ARM_L2X0_ONLINE,
CPUHP_AP_PERF_POWERPC_NEST_ONLINE,
CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE,
+   CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE,
CPUHP_AP_PERF_ARM_QCOM_L2_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
-- 
2.7.4



[PATCH v6 10/11] powerpc/perf: Thread IMC PMU functions

2017-04-03 Thread Madhavan Srinivasan
From: Hemant Kumar 

This patch adds the PMU functions required for event initialization,
read, update, add, del etc. for thread IMC PMU. Thread IMC PMUs are used
for per-task monitoring. These PMUs don't need any hotplugging support.

For each CPU, a page of memory is allocated and is kept static i.e.,
these pages will exist till the machine shuts down. The base address of
this page is assigned to the ldbar of that cpu. As soon as we do that,
the thread IMC counters start running for that cpu and the data of these
counters are assigned to the page allocated. But we use this for
per-task monitoring. Whenever we start monitoring a task, the event is
added is onto the task. At that point, we read the initial value of the
event. Whenever, we stop monitoring the task, the final value is taken
and the difference is the event data.

Now, a task can move to a different cpu. Suppose a task X is moving from
cpu A to cpu B. When the task is scheduled out of A, we get an
event_del for A, and hence, the event data is updated. And, we stop
updating the X's event data. As soon as X moves on to B, event_add is
called for B, and we again update the event_data. And this is how it
keeps on updating the event data even when the task is scheduled on to
different cpus.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h|   5 +
 arch/powerpc/perf/imc-pmu.c   | 173 +-
 arch/powerpc/platforms/powernv/opal-imc.c |   3 +
 3 files changed, 180 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index c63bc78fd6f6..42f0149886b4 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -25,12 +25,16 @@
 
 #define IMC_NEST_MAX_PAGES 16
 #define IMC_CORE_COUNTER_MEM   8192
+#define IMC_THREAD_COUNTER_MEM 8192
 
 #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
 #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest"
 #define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core"
 #define IMC_DTB_THREAD_COMPAT  "ibm,imc-counters-thread"
 
+#define THREAD_IMC_LDBAR_MASK   0x0003e000
+#define THREAD_IMC_ENABLE   0x8000
+
 /*
  * Structure to hold per chip specific memory address
  * information for nest pmus. Nest Counter data are exported
@@ -73,4 +77,5 @@ struct imc_pmu {
 
 int imc_get_domain(struct device_node *pmu_dev);
 void core_imc_disable(void);
+void thread_imc_disable(void);
 #endif /* PPC_POWERNV_IMC_PMU_DEF_H */
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 35b3564747e2..3db637c6d3f4 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -30,6 +30,9 @@ static u64 per_core_pdbar_add[IMC_MAX_CHIPS][IMC_MAX_CORES];
 static cpumask_t core_imc_cpumask;
 struct imc_pmu *core_imc_pmu;
 
+/* Maintains base address for all the cpus */
+static u64 per_cpu_add[NR_CPUS];
+
 /* Needed for sanity check */
 extern u64 nest_max_offset;
 extern u64 core_max_offset;
@@ -461,6 +464,56 @@ static int core_imc_event_init(struct perf_event *event)
return 0;
 }
 
+static int thread_imc_event_init(struct perf_event *event)
+{
+   struct task_struct *target;
+
+   if (event->attr.type != event->pmu->type)
+   return -ENOENT;
+
+   /* Sampling not supported */
+   if (event->hw.sample_period)
+   return -EINVAL;
+
+   event->hw.idx = -1;
+
+   /* Sanity check for config (event offset) */
+   if (event->attr.config > thread_max_offset)
+   return -EINVAL;
+
+   target = event->hw.target;
+
+   if (!target)
+   return -EINVAL;
+
+   event->pmu->task_ctx_nr = perf_sw_context;
+   return 0;
+}
+
+static void thread_imc_read_counter(struct perf_event *event)
+{
+   u64 *addr, data;
+   int cpu_id = smp_processor_id();
+
+   addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config);
+   data = __be64_to_cpu(READ_ONCE(*addr));
+   local64_set(&event->hw.prev_count, data);
+}
+
+static void thread_imc_perf_event_update(struct perf_event *event)
+{
+   u64 counter_prev, counter_new, final_count, *addr;
+   int cpu_id = smp_processor_id();
+
+   addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config);
+   counter_prev = local64_read(&event->hw.prev_count);
+   counter_new = __be64_to_cpu(READ_ONCE(*addr));
+   final_count = counter_new - counter_prev;
+
+   local64_set(&event->hw.prev_count, counter_new);
+   local64_add(final_count, &event->count);
+}
+
 static void imc_read_counter(struct perf_event *event)
 {
u64 *addr, data;
@@ -511,6 +564,53 @@ static int imc_event_add(struct perf_event *event, int 
flags)
return 0;
 }
 
+static void thread_imc_event_start(struct perf_event *event, int flags)
+{
+   thread_i

[PATCH v6 09/11] powerpc/powernv: Thread IMC events detection

2017-04-03 Thread Madhavan Srinivasan
From: Hemant Kumar 

Patch adds support for detection of thread IMC events. It adds a new
domain IMC_DOMAIN_THREAD and it is determined with the help of the
compatibility string "ibm,imc-counters-thread" based on the IMC device
tree.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h| 2 ++
 arch/powerpc/perf/imc-pmu.c   | 1 +
 arch/powerpc/platforms/powernv/opal-imc.c | 9 -
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 4aa63191456a..c63bc78fd6f6 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -29,6 +29,7 @@
 #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
 #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest"
 #define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core"
+#define IMC_DTB_THREAD_COMPAT  "ibm,imc-counters-thread"
 
 /*
  * Structure to hold per chip specific memory address
@@ -67,6 +68,7 @@ struct imc_pmu {
  */
 #define IMC_DOMAIN_NEST1
 #define IMC_DOMAIN_CORE2
+#define IMC_DOMAIN_THREAD  3
 #define IMC_DOMAIN_UNKNOWN -1
 
 int imc_get_domain(struct device_node *pmu_dev);
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 45f9b35142a7..35b3564747e2 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -33,6 +33,7 @@ struct imc_pmu *core_imc_pmu;
 /* Needed for sanity check */
 extern u64 nest_max_offset;
 extern u64 core_max_offset;
+extern u64 thread_max_offset;
 
 PMU_FORMAT_ATTR(event, "config:0-20");
 static struct attribute *imc_format_attrs[] = {
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index f261fc933959..ac625cf13875 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -40,6 +40,7 @@ extern int init_imc_pmu(struct imc_events *events,
int idx, struct imc_pmu *pmu_ptr);
 u64 nest_max_offset;
 u64 core_max_offset;
+u64 thread_max_offset;
 
 static int imc_event_info(char *name, struct imc_events *events)
 {
@@ -87,6 +88,10 @@ static void update_max_value(u32 value, int pmu_domain)
if (core_max_offset < value)
core_max_offset = value;
break;
+   case IMC_DOMAIN_THREAD:
+   if (thread_max_offset < value)
+   thread_max_offset = value;
+   break;
default:
/* Unknown domain, return */
return;
@@ -240,6 +245,8 @@ int imc_get_domain(struct device_node *pmu_dev)
return IMC_DOMAIN_NEST;
if (of_device_is_compatible(pmu_dev, IMC_DTB_CORE_COMPAT))
return IMC_DOMAIN_CORE;
+   if (of_device_is_compatible(pmu_dev, IMC_DTB_THREAD_COMPAT))
+   return IMC_DOMAIN_THREAD;
else
return IMC_DOMAIN_UNKNOWN;
 }
@@ -278,7 +285,7 @@ static void imc_free_events(struct imc_events *events, int 
nr_entries)
 /*
  * imc_pmu_create : Takes the parent device which is the pmu unit and a
  *  pmu_index as the inputs.
- * Allocates memory for the pmu, sets up its domain (NEST or CORE), and
+ * Allocates memory for the pmu, sets up its domain (NEST/CORE/THREAD), and
  * allocates memory for the events supported by this pmu. Assigns a name for
  * the pmu. Calls imc_events_node_parser() to setup the individual events.
  * If everything goes fine, it calls, init_imc_pmu() to setup the pmu device
-- 
2.7.4



[PATCH v6 07/11] powerpc/powernv: Core IMC events detection

2017-04-03 Thread Madhavan Srinivasan
From: Hemant Kumar 

This patch adds support for detection of core IMC events along with the
Nest IMC events. It adds a new domain IMC_DOMAIN_CORE and its determined
with the help of the compatibility string "ibm,imc-counters-core" based
on the IMC device tree.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h|  2 ++
 arch/powerpc/perf/imc-pmu.c   |  3 +++
 arch/powerpc/platforms/powernv/opal-imc.c | 18 --
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index e13f51047522..2c39603ff3e7 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -26,6 +26,7 @@
 
 #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
 #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest"
+#define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core"
 
 /*
  * Structure to hold per chip specific memory address
@@ -63,6 +64,7 @@ struct imc_pmu {
  * Domains for IMC PMUs
  */
 #define IMC_DOMAIN_NEST1
+#define IMC_DOMAIN_CORE2
 #define IMC_DOMAIN_UNKNOWN -1
 
 int imc_get_domain(struct device_node *pmu_dev);
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index b28835b861b3..728c67e139e0 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -19,8 +19,11 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
 static cpumask_t nest_imc_cpumask;
 
+struct imc_pmu *core_imc_pmu;
+
 /* Needed for sanity check */
 extern u64 nest_max_offset;
+extern u64 core_max_offset;
 
 PMU_FORMAT_ATTR(event, "config:0-20");
 static struct attribute *imc_format_attrs[] = {
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index a98678266b0d..f6f63399ab06 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -34,10 +34,12 @@
 
 extern struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+extern struct imc_pmu *core_imc_pmu;
 
 extern int init_imc_pmu(struct imc_events *events,
int idx, struct imc_pmu *pmu_ptr);
 u64 nest_max_offset;
+u64 core_max_offset;
 
 static int imc_event_info(char *name, struct imc_events *events)
 {
@@ -81,6 +83,10 @@ static void update_max_value(u32 value, int pmu_domain)
if (nest_max_offset < value)
nest_max_offset = value;
break;
+   case IMC_DOMAIN_CORE:
+   if (core_max_offset < value)
+   core_max_offset = value;
+   break;
default:
/* Unknown domain, return */
return;
@@ -232,6 +238,8 @@ int imc_get_domain(struct device_node *pmu_dev)
 {
if (of_device_is_compatible(pmu_dev, IMC_DTB_NEST_COMPAT))
return IMC_DOMAIN_NEST;
+   if (of_device_is_compatible(pmu_dev, IMC_DTB_CORE_COMPAT))
+   return IMC_DOMAIN_CORE;
else
return IMC_DOMAIN_UNKNOWN;
 }
@@ -299,7 +307,10 @@ static int imc_pmu_create(struct device_node *parent, int 
pmu_index)
goto free_pmu;
 
/* Needed for hotplug/migration */
-   per_nest_pmu_arr[pmu_index] = pmu_ptr;
+   if (pmu_ptr->domain == IMC_DOMAIN_CORE)
+   core_imc_pmu = pmu_ptr;
+   else if (pmu_ptr->domain == IMC_DOMAIN_NEST)
+   per_nest_pmu_arr[pmu_index] = pmu_ptr;
 
/*
 * "events" property inside a PMU node contains the phandle value
@@ -355,7 +366,10 @@ static int imc_pmu_create(struct device_node *parent, int 
pmu_index)
}
 
/* Save the name to register it later */
-   sprintf(buf, "nest_%s", (char *)pp->value);
+   if (pmu_ptr->domain == IMC_DOMAIN_NEST)
+   sprintf(buf, "nest_%s", (char *)pp->value);
+   else
+   sprintf(buf, "%s_imc", (char *)pp->value);
pmu_ptr->pmu.name = (char *)buf;
 
/*
-- 
2.7.4



[PATCH v6 06/11] powerpc/perf: IMC pmu cpumask and cpu hotplug support

2017-04-03 Thread Madhavan Srinivasan
From: Hemant Kumar 

Adds cpumask attribute to be used by each IMC pmu. Only one cpu (any
online CPU) from each chip for nest PMUs is designated to read counters.

On CPU hotplug, dying CPU is checked to see whether it is one of the
designated cpus, if yes, next online cpu from the same chip (for nest
units) is designated as new cpu to read counters. For this purpose, we
introduce a new state : CPUHP_AP_PERF_POWERPC_NEST_ONLINE.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/opal-api.h|  13 ++-
 arch/powerpc/include/asm/opal.h|   3 +
 arch/powerpc/perf/imc-pmu.c| 155 -
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 include/linux/cpuhotplug.h |   1 +
 5 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index a0aa285869b5..23fc51e9d71d 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -168,7 +168,8 @@
 #define OPAL_INT_SET_MFRR  125
 #define OPAL_PCI_TCE_KILL  126
 #define OPAL_NMMU_SET_PTCR 127
-#define OPAL_LAST  127
+#define OPAL_NEST_IMC_COUNTERS_CONTROL 149
+#define OPAL_LAST  149
 
 /* Device tree flags */
 
@@ -928,6 +929,16 @@ enum {
OPAL_PCI_TCE_KILL_ALL,
 };
 
+/* Operation argument to OPAL_NEST_IMC_COUNTERS_CONTROL */
+enum {
+   OPAL_NEST_IMC_PRODUCTION_MODE = 1,
+};
+
+enum {
+   OPAL_NEST_IMC_STOP,
+   OPAL_NEST_IMC_START,
+};
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __OPAL_API_H */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 1ff03a6da76e..d93d08204243 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -227,6 +227,9 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t 
kill_type,
  uint64_t dma_addr, uint32_t npages);
 int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);
 
+int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
+   uint64_t value2, uint64_t value3);
+
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
   int depth, void *data);
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index bd5bf66bd920..b28835b861b3 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -17,6 +17,7 @@
 
 struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+static cpumask_t nest_imc_cpumask;
 
 /* Needed for sanity check */
 extern u64 nest_max_offset;
@@ -32,6 +33,152 @@ static struct attribute_group imc_format_group = {
.attrs = imc_format_attrs,
 };
 
+/* Get the cpumask printed to a buffer "buf" */
+static ssize_t imc_pmu_cpumask_get_attr(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   cpumask_t *active_mask;
+
+   active_mask = &nest_imc_cpumask;
+   return cpumap_print_to_pagebuf(true, buf, active_mask);
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, imc_pmu_cpumask_get_attr, NULL);
+
+static struct attribute *imc_pmu_cpumask_attrs[] = {
+   &dev_attr_cpumask.attr,
+   NULL,
+};
+
+static struct attribute_group imc_pmu_cpumask_attr_group = {
+   .attrs = imc_pmu_cpumask_attrs,
+};
+
+/*
+ * nest_init : Initializes the nest imc engine for the current chip.
+ */
+static void nest_init(int *loc)
+{
+   int rc;
+
+   rc = opal_nest_imc_counters_control(OPAL_NEST_IMC_PRODUCTION_MODE,
+   OPAL_NEST_IMC_START, 0, 0);
+   if (rc)
+   loc[smp_processor_id()] = 1;
+}
+
+static void nest_change_cpu_context(int old_cpu, int new_cpu)
+{
+   int i;
+
+   for (i = 0;
+(per_nest_pmu_arr[i] != NULL) && (i < IMC_MAX_PMUS); i++)
+   perf_pmu_migrate_context(&per_nest_pmu_arr[i]->pmu,
+   old_cpu, new_cpu);
+}
+
+static int ppc_nest_imc_cpu_online(unsigned int cpu)
+{
+   int nid;
+   const struct cpumask *l_cpumask;
+   struct cpumask tmp_mask;
+
+   /* Find the cpumask of this node */
+   nid = cpu_to_node(cpu);
+   l_cpumask = cpumask_of_node(nid);
+
+   /*
+* If any of the cpu from this node is already present in the mask,
+* just return, if not, then set this cpu in the mask.
+*/
+   if (!cpumask_and(&tmp_mask, l_cpumask, &nest_imc_cpumask)) {
+   cpumask_set_cpu(cpu, &nest_imc_cpumask);
+   return 0;
+   }
+
+   return 0;
+}
+
+static int ppc_nest_imc_cpu_offline(unsigned int cpu)
+{
+   int nid, target = -1;
+

[PATCH v6 08/11] powerpc/perf: PMU functions for Core IMC and hotplugging

2017-04-03 Thread Madhavan Srinivasan
From: Hemant Kumar 

This patch adds the PMU function to initialize a core IMC event. It also
adds cpumask initialization function for core IMC PMU. For
initialization, a 8KB of memory is allocated per core where the data
for core IMC counters will be accumulated. The base address for this
page is sent to OPAL via an OPAL call which initializes various SCOMs
related to Core IMC initialization. Upon any errors, the pages are
free'ed and core IMC counters are disabled using the same OPAL call.

For CPU hotplugging, a cpumask is initialized which contains an online
CPU from each core. If a cpu goes offline, we check whether that cpu
belongs to the core imc cpumask, if yes, then, we migrate the PMU
context to any other online cpu (if available) in that core. If a cpu
comes back online, then this cpu will be added to the core imc cpumask
only if there was no other cpu from that core in the previous cpumask.

To register the hotplug functions for core_imc, a new state
CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE is added to the list of existing
states.

Patch also adds OPAL device shutdown callback. Needed to disable the
IMC core engine to handle kexec.

Signed-off-by: Hemant Kumar 
[Anju: Changed the condition for setting cpumask for core
in imc_pmu_cpumask_get_attr() ]
Signed-off-by: Anju T Sudhakar 

Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h |   3 +
 arch/powerpc/include/asm/opal-api.h|  10 +-
 arch/powerpc/include/asm/opal.h|   2 +
 arch/powerpc/perf/imc-pmu.c| 267 -
 arch/powerpc/platforms/powernv/opal-imc.c  |  13 +-
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 include/linux/cpuhotplug.h |   1 +
 7 files changed, 287 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 2c39603ff3e7..4aa63191456a 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -21,8 +21,10 @@
 #define IMC_MAX_CHIPS  32
 #define IMC_MAX_PMUS   32
 #define IMC_MAX_PMU_NAME_LEN   256
+#define IMC_MAX_CORES  32
 
 #define IMC_NEST_MAX_PAGES 16
+#define IMC_CORE_COUNTER_MEM   8192
 
 #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
 #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest"
@@ -68,4 +70,5 @@ struct imc_pmu {
 #define IMC_DOMAIN_UNKNOWN -1
 
 int imc_get_domain(struct device_node *pmu_dev);
+void core_imc_disable(void);
 #endif /* PPC_POWERNV_IMC_PMU_DEF_H */
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 23fc51e9d71d..971918deb793 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -169,7 +169,8 @@
 #define OPAL_PCI_TCE_KILL  126
 #define OPAL_NMMU_SET_PTCR 127
 #define OPAL_NEST_IMC_COUNTERS_CONTROL 149
-#define OPAL_LAST  149
+#define OPAL_CORE_IMC_COUNTERS_CONTROL 150
+#define OPAL_LAST  150
 
 /* Device tree flags */
 
@@ -939,6 +940,13 @@ enum {
OPAL_NEST_IMC_START,
 };
 
+/* Operation argument to Core IMC */
+enum {
+   OPAL_CORE_IMC_DISABLE,
+   OPAL_CORE_IMC_ENABLE,
+   OPAL_CORE_IMC_INIT,
+};
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __OPAL_API_H */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index d93d08204243..c4baa6d32037 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -229,6 +229,8 @@ int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);
 
 int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
uint64_t value2, uint64_t value3);
+int64_t opal_core_imc_counters_control(uint64_t operation, uint64_t addr,
+   uint64_t value2, uint64_t value3);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 728c67e139e0..45f9b35142a7 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1,5 +1,5 @@
 /*
- * Nest Performance Monitor counter support.
+ * IMC Performance Monitor counter support.
  *
  * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
  *  (C) 2016 Hemant K Shaw, IBM Corporation.
@@ -19,6 +19,15 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
 static cpumask_t nest_imc_cpumask;
 
+/*
+ * Maintains base addresses for all the cores.
+ * MAX chip and core are defined as 32. So we
+ * statically allocate 8K for this structure.
+ *
+ * TODO -- Could be made dynamic
+ */
+static u64 per_core_pdbar_add[IMC_MAX_CHIPS][IMC_MAX_CORES];
+static cpumask_t core_imc_cpumask;
 s

[PATCH v6 05/11] powerpc/perf: Generic imc pmu event functions

2017-04-03 Thread Madhavan Srinivasan
From: Hemant Kumar 

Since, the IMC counters' data are periodically fed to a memory location,
the functions to read/update, start/stop, add/del can be generic and can
be used by all IMC PMU units.

This patch adds a set of generic imc pmu related event functions to be
used  by each imc pmu unit. Add code to setup format attribute and to
register imc pmus. Add a event_init function for nest_imc events.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h|   1 +
 arch/powerpc/perf/imc-pmu.c   | 137 ++
 arch/powerpc/platforms/powernv/opal-imc.c |  30 ++-
 3 files changed, 164 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index a3d4f1bf9492..e13f51047522 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -65,4 +65,5 @@ struct imc_pmu {
 #define IMC_DOMAIN_NEST1
 #define IMC_DOMAIN_UNKNOWN -1
 
+int imc_get_domain(struct device_node *pmu_dev);
 #endif /* PPC_POWERNV_IMC_PMU_DEF_H */
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index ec464c76b749..bd5bf66bd920 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -18,6 +18,132 @@
 struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
 
+/* Needed for sanity check */
+extern u64 nest_max_offset;
+
+PMU_FORMAT_ATTR(event, "config:0-20");
+static struct attribute *imc_format_attrs[] = {
+   &format_attr_event.attr,
+   NULL,
+};
+
+static struct attribute_group imc_format_group = {
+   .name = "format",
+   .attrs = imc_format_attrs,
+};
+
+static int nest_imc_event_init(struct perf_event *event)
+{
+   int chip_id;
+   u32 config = event->attr.config;
+   struct perchip_nest_info *pcni;
+
+   if (event->attr.type != event->pmu->type)
+   return -ENOENT;
+
+   /* Sampling not supported */
+   if (event->hw.sample_period)
+   return -EINVAL;
+
+   /* unsupported modes and filters */
+   if (event->attr.exclude_user   ||
+   event->attr.exclude_kernel ||
+   event->attr.exclude_hv ||
+   event->attr.exclude_idle   ||
+   event->attr.exclude_host   ||
+   event->attr.exclude_guest)
+   return -EINVAL;
+
+   if (event->cpu < 0)
+   return -EINVAL;
+
+   /* Sanity check for config (event offset) */
+   if (config > nest_max_offset)
+   return -EINVAL;
+
+   chip_id = topology_physical_package_id(event->cpu);
+   pcni = &nest_perchip_info[chip_id];
+
+   /*
+* Memory for Nest HW counter data could be in multiple pages.
+* Hence check and pick the right base page from the event offset,
+* and then add to it.
+*/
+   event->hw.event_base = pcni->vbase[config/PAGE_SIZE] +
+   (config & ~PAGE_MASK);
+
+   return 0;
+}
+
+static void imc_read_counter(struct perf_event *event)
+{
+   u64 *addr, data;
+
+   addr = (u64 *)event->hw.event_base;
+   data = __be64_to_cpu(READ_ONCE(*addr));
+   local64_set(&event->hw.prev_count, data);
+}
+
+static void imc_perf_event_update(struct perf_event *event)
+{
+   u64 counter_prev, counter_new, final_count, *addr;
+
+   addr = (u64 *)event->hw.event_base;
+   counter_prev = local64_read(&event->hw.prev_count);
+   counter_new = __be64_to_cpu(READ_ONCE(*addr));
+   final_count = counter_new - counter_prev;
+
+   local64_set(&event->hw.prev_count, counter_new);
+   local64_add(final_count, &event->count);
+}
+
+static void imc_event_start(struct perf_event *event, int flags)
+{
+   /*
+* In Memory Counters are free flowing counters. HW or the microcode
+* keeps adding to the counter offset in memory. To get event
+* counter value, we snapshot the value here and we calculate
+* delta at later point.
+*/
+   imc_read_counter(event);
+}
+
+static void imc_event_stop(struct perf_event *event, int flags)
+{
+   /*
+* Take a snapshot and calculate the delta and update
+* the event counter values.
+*/
+   imc_perf_event_update(event);
+}
+
+static int imc_event_add(struct perf_event *event, int flags)
+{
+   if (flags & PERF_EF_START)
+   imc_event_start(event, flags);
+
+   return 0;
+}
+
+/* update_pmu_ops : Populate the appropriate operations for "pmu" */
+static int update_pmu_ops(struct imc_pmu *pmu)
+{
+   if (!pmu)
+   return -EINVAL;
+
+   pmu->pmu.task_ctx_nr = perf_invalid_context;
+   pmu->pmu.event_init = nest_imc_event_init;
+   pmu->pmu.add = imc_event_add;
+   pmu->pmu.del = imc_event_stop;
+   pmu->pmu.start = imc_event_start;

[PATCH v6 04/11] powerpc/perf: Add event attribute and group to IMC pmus

2017-04-03 Thread Madhavan Srinivasan
From: Hemant Kumar 

Device tree IMC driver code parses the IMC units and their events. It
passes the information to IMC pmu code which is placed in powerpc/perf
as "imc-pmu.c".

This patch creates only event attributes and attribute groups for the
IMC pmus.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/perf/Makefile|  6 +-
 arch/powerpc/perf/imc-pmu.c   | 97 +++
 arch/powerpc/platforms/powernv/opal-imc.c | 12 +++-
 3 files changed, 112 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/perf/imc-pmu.c

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 4d606b99a5cb..d0d1f04203c7 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -2,10 +2,14 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
 
 obj-$(CONFIG_PERF_EVENTS)  += callchain.o perf_regs.o
 
+imc-$(CONFIG_PPC_POWERNV)   += imc-pmu.o
+
 obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += power4-pmu.o ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
-  isa207-common.o power8-pmu.o power9-pmu.o
+  isa207-common.o power8-pmu.o power9-pmu.o \
+  $(imc-y)
+
 obj32-$(CONFIG_PPC_PERF_CTRS)  += mpc7450-pmu.o
 
 obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
new file mode 100644
index ..ec464c76b749
--- /dev/null
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -0,0 +1,97 @@
+/*
+ * Nest Performance Monitor counter support.
+ *
+ * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
+ *  (C) 2016 Hemant K Shaw, IBM Corporation.
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+
+/* dev_str_attr : Populate event "name" and string "str" in attribute */
+static struct attribute *dev_str_attr(const char *name, const char *str)
+{
+   struct perf_pmu_events_attr *attr;
+
+   attr = kzalloc(sizeof(*attr), GFP_KERNEL);
+
+   sysfs_attr_init(&attr->attr.attr);
+
+   attr->event_str = str;
+   attr->attr.attr.name = name;
+   attr->attr.attr.mode = 0444;
+   attr->attr.show = perf_event_sysfs_show;
+
+   return &attr->attr.attr;
+}
+
+/*
+ * update_events_in_group: Update the "events" information in an attr_group
+ * and assign the attr_group to the pmu "pmu".
+ */
+static int update_events_in_group(struct imc_events *events,
+ int idx, struct imc_pmu *pmu)
+{
+   struct attribute_group *attr_group;
+   struct attribute **attrs;
+   int i;
+
+   /* Allocate memory for attribute group */
+   attr_group = kzalloc(sizeof(*attr_group), GFP_KERNEL);
+   if (!attr_group)
+   return -ENOMEM;
+
+   /* Allocate memory for attributes */
+   attrs = kzalloc((sizeof(struct attribute *) * (idx + 1)), GFP_KERNEL);
+   if (!attrs) {
+   kfree(attr_group);
+   return -ENOMEM;
+   }
+
+   attr_group->name = "events";
+   attr_group->attrs = attrs;
+   for (i = 0; i < idx; i++, events++) {
+   attrs[i] = dev_str_attr((char *)events->ev_name,
+   (char *)events->ev_value);
+   }
+
+   pmu->attr_groups[0] = attr_group;
+   return 0;
+}
+
+/*
+ * init_imc_pmu : Setup the IMC pmu device in "pmu_ptr" and its events
+ *"events".
+ * Setup the cpu mask information for these pmus and setup the state machine
+ * hotplug notifiers as well.
+ */
+int init_imc_pmu(struct imc_events *events, int idx,
+struct imc_pmu *pmu_ptr)
+{
+   int ret = -ENODEV;
+
+   ret = update_events_in_group(events, idx, pmu_ptr);
+   if (ret)
+   goto err_free;
+
+   return 0;
+
+err_free:
+   /* Only free the attr_groups which are dynamically allocated  */
+   if (pmu_ptr->attr_groups[0]) {
+   kfree(pmu_ptr->attr_groups[0]->attrs);
+   kfree(pmu_ptr->attr_groups[0]);
+   }
+
+   return ret;
+}
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index ba0203e669c0..73c46703c2af 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -32,8 +32,11 @@
 #include 
 #include 
 
-struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
-struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+extern struct perchip_n

[PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events

2017-04-03 Thread Madhavan Srinivasan
From: Hemant Kumar 

Parse device tree to detect IMC units. Traverse through each IMC unit
node to find supported events and corresponding unit/scale files (if any).

Here is the DTS file for reference:


https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts

The device tree for IMC counters starts at the node "imc-counters".
This node contains all the IMC PMU nodes and event nodes
for these IMC PMUs. The PMU nodes have an "events" property which has a
phandle value for the actual events node. The events are separated from
the PMU nodes to abstract out the common events. For example, PMU node
"mcs0", "mcs1" etc. will contain a pointer to "nest-mcs-events" since,
the events are common between these PMUs. These events have a different
prefix based on their relation to different PMUs, and hence, the PMU
nodes themselves contain an "events-prefix" property. The value for this
property concatenated to the event name, forms the actual event
name. Also, the PMU have a "reg" field as the base offset for the events
which belong to this PMU. This "reg" field is added to an event in the
"events" node, which gives us the location of the counter data. Kernel
code uses this offset as event configuration value.

Device tree parser code also looks for scale/unit property in the event
node and passes on the value as an event attr for perf interface to use
in the post processing by the perf tool. Some PMUs may have common scale
and unit properties which implies that all events supported by this PMU
inherit the scale and unit properties of the PMU itself. For those
events, we need to set the common unit and scale values.

For failure to initialize any unit or any event, disable that unit and
continue setting up the rest of them.

Signed-off-by: Hemant Kumar 
Signed-off-by: Anju T Sudhakar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/platforms/powernv/opal-imc.c | 386 ++
 1 file changed, 386 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index c476d596c6a8..ba0203e669c0 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -33,6 +33,388 @@
 #include 
 
 struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+
+static int imc_event_info(char *name, struct imc_events *events)
+{
+   char *buf;
+
+   /* memory for content */
+   buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   events->ev_name = name;
+   events->ev_value = buf;
+   return 0;
+}
+
+static int imc_event_info_str(struct property *pp, char *name,
+  struct imc_events *events)
+{
+   int ret;
+
+   ret = imc_event_info(name, events);
+   if (ret)
+   return ret;
+
+   if (!pp->value || (strnlen(pp->value, pp->length) == pp->length) ||
+  (pp->length > IMC_MAX_PMU_NAME_LEN))
+   return -EINVAL;
+   strncpy(events->ev_value, (const char *)pp->value, pp->length);
+
+   return 0;
+}
+
+static int imc_event_info_val(char *name, u32 val,
+ struct imc_events *events)
+{
+   int ret;
+
+   ret = imc_event_info(name, events);
+   if (ret)
+   return ret;
+   sprintf(events->ev_value, "event=0x%x", val);
+
+   return 0;
+}
+
+static int set_event_property(struct property *pp, char *event_prop,
+ struct imc_events *events, char *ev_name)
+{
+   char *buf;
+   int ret;
+
+   buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   sprintf(buf, "%s.%s", ev_name, event_prop);
+   ret = imc_event_info_str(pp, buf, events);
+   if (ret) {
+   kfree(events->ev_name);
+   kfree(events->ev_value);
+   }
+
+   return ret;
+}
+
+/*
+ * imc_events_node_parser: Parse the event node "dev" and assign the parsed
+ * information to event "events".
+ *
+ * Parses the "reg" property of this event. "reg" gives us the event offset.
+ * Also, parse the "scale" and "unit" properties, if any.
+ */
+static int imc_events_node_parser(struct device_node *dev,
+ struct imc_events *events,
+ struct property *event_scale,
+ struct property *event_unit,
+ struct property *name_prefix,
+ u32 reg)
+{
+   struct property *name, *pp;
+   char *ev_name;
+   u32 val;
+   int idx = 0, ret;
+
+   if (!dev)
+   return -EINVAL;
+
+   /*
+* Loop through each property of an event node
+*/
+   name = of_find_property(dev, "event-name", NULL);
+   if (!name)
+   return -ENODEV;
+
+ 

[PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module

2017-04-03 Thread Madhavan Srinivasan
From: Hemant Kumar 

This patch does three things :
 - Enables "opal.c" to create a platform device for the IMC interface
   according to the appropriate compatibility string.
 - Find the reserved-memory region details from the system device tree
   and get the base address of HOMER (Reserved memory) region address for each 
chip.
 - We also get the Nest PMU counter data offsets (in the HOMER region)
   and their sizes. The offsets for the counters' data are fixed and
   won't change from chip to chip.

The device tree parsing logic is separated from the PMU creation
functions (which is done in subsequent patches).

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/platforms/powernv/Makefile   |   2 +-
 arch/powerpc/platforms/powernv/opal-imc.c | 126 ++
 arch/powerpc/platforms/powernv/opal.c |  14 
 3 files changed, 141 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/platforms/powernv/opal-imc.c

diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index b5d98cb3f482..44909fec1121 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -2,7 +2,7 @@ obj-y   += setup.o opal-wrappers.o opal.o 
opal-async.o idle.o
 obj-y  += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
 obj-y  += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
 obj-y  += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
-obj-y  += opal-kmsg.o
+obj-y  += opal-kmsg.o opal-imc.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
new file mode 100644
index ..c476d596c6a8
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -0,0 +1,126 @@
+/*
+ * OPAL IMC interface detection driver
+ * Supported on POWERNV platform
+ *
+ * Copyright   (C) 2016 Madhavan Srinivasan, IBM Corporation.
+ * (C) 2016 Hemant K Shaw, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+
+static int opal_imc_counters_probe(struct platform_device *pdev)
+{
+   struct device_node *child, *imc_dev, *rm_node = NULL;
+   struct perchip_nest_info *pcni;
+   u32 pages, nest_offset, nest_size, idx;
+   int i = 0;
+   const char *node_name;
+   const __be32 *addrp;
+   u64 reg_addr, reg_size;
+
+   if (!pdev || !pdev->dev.of_node)
+   return -ENODEV;
+
+   /*
+* Check whether this kdump kernel. If yes, just return.
+*/
+   if (is_kdump_kernel())
+   return -ENODEV;
+
+   imc_dev = pdev->dev.of_node;
+
+   /*
+* nest_offset : where the nest-counters' data start.
+* size : size of the entire nest-counters region
+*/
+   if (of_property_read_u32(imc_dev, "imc-nest-offset", &nest_offset))
+   goto err;
+
+   if (of_property_read_u32(imc_dev, "imc-nest-size", &nest_size))
+   goto err;
+
+   /* Find the "homer region" for each chip */
+   rm_node = of_find_node_by_path("/reserved-memory");
+   if (!rm_node)
+   goto err;
+
+   for_each_child_of_node(rm_node, child) {
+   if (of_property_read_string_index(child, "name", 0,
+ &node_name))
+   continue;
+   if (strncmp("ibm,homer-image", node_name,
+   strlen("ibm,homer-image")))
+   continue;
+
+   /* Get the chip id to which the above homer region belongs to */
+   if (of_property_read_u32(child, "ibm,chip-id", &idx))
+   goto err;
+
+   pcni = &nest_perchip_info[idx];
+   addrp = of_get_address(child, 0, ®_size, NULL);
+   if (!addrp)
+   goto err;
+
+   /* Fetch the homer region base address */
+   reg_addr = of_read_number(addrp, 2);
+   pcni->pbase = reg_addr;
+   /* Add the nest IMC Base offset */
+   pcni->pbase =

[PATCH v6 01/11] powerpc/powernv: Data structure and macros definitions

2017-04-03 Thread Madhavan Srinivasan
From: Hemant Kumar 

Create new header file "imc-pmu.h" to add the data structures
and macros needed for IMC pmu support.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h | 68 ++
 1 file changed, 68 insertions(+)
 create mode 100644 arch/powerpc/include/asm/imc-pmu.h

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
new file mode 100644
index ..a3d4f1bf9492
--- /dev/null
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -0,0 +1,68 @@
+#ifndef PPC_POWERNV_IMC_PMU_DEF_H
+#define PPC_POWERNV_IMC_PMU_DEF_H
+
+/*
+ * IMC Nest Performance Monitor counter support.
+ *
+ * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
+ *   (C) 2016 Hemant K Shaw, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define IMC_MAX_CHIPS  32
+#define IMC_MAX_PMUS   32
+#define IMC_MAX_PMU_NAME_LEN   256
+
+#define IMC_NEST_MAX_PAGES 16
+
+#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
+#define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest"
+
+/*
+ * Structure to hold per chip specific memory address
+ * information for nest pmus. Nest Counter data are exported
+ * in per-chip reserved memory region by the PORE Engine.
+ */
+struct perchip_nest_info {
+   u32 chip_id;
+   u64 pbase;
+   u64 vbase[IMC_NEST_MAX_PAGES];
+   u64 size;
+};
+
+/*
+ * Place holder for nest pmu events and values.
+ */
+struct imc_events {
+   char *ev_name;
+   char *ev_value;
+};
+
+/*
+ * Device tree parser code detects IMC pmu support and
+ * registers new IMC pmus. This structure will
+ * hold the pmu functions and attrs for each imc pmu and
+ * will be referenced at the time of pmu registration.
+ */
+struct imc_pmu {
+   struct pmu pmu;
+   int domain;
+   const struct attribute_group *attr_groups[4];
+};
+
+/*
+ * Domains for IMC PMUs
+ */
+#define IMC_DOMAIN_NEST1
+#define IMC_DOMAIN_UNKNOWN -1
+
+#endif /* PPC_POWERNV_IMC_PMU_DEF_H */
-- 
2.7.4



[PATCH v6 00/11] IMC Instrumentation Support

2017-04-03 Thread Madhavan Srinivasan
Power9 has In-Memory-Collection (IMC) infrastructure which contains
various Performance Monitoring Units (PMUs) at Nest level (these are
on-chip but off-core), Core level and Thread level.

The Nest PMU counters are handled by a Nest IMC microcode which runs
in the OCC (On-Chip Controller) complex. The microcode collects the
counter data and moves the nest IMC counter data to memory.

The Core and Thread IMC PMU counters are handled in the core. Core
level PMU counters give us the IMC counters' data per core and thread
level PMU counters give us the IMC counters' data per CPU thread.

This patchset enables the nest IMC, core IMC and thread IMC
PMUs and is based on the initial work done by Madhavan Srinivasan.
"Nest Instrumentation Support" :
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132078.html

v1 for this patchset can be found here :
https://lwn.net/Articles/705475/

Nest events:
Per-chip nest instrumentation provides various per-chip metrics
such as memory, powerbus, Xlink and Alink bandwidth.

Core events:
Per-core IMC instrumentation provides various per-core metrics
such as non-idle cycles, non-idle instructions, various cache and
memory related metrics etc.

Thread events:
All the events for thread level are same as core level with the
difference being in the domain. These are per-cpu metrics.

PMU Events' Information:
OPAL obtains the IMC PMU and event information from the IMC Catalog
and passes on to the kernel via the device tree. The events' information
contains :
 - Event name
 - Event Offset
 - Event description
and, maybe :
 - Event scale
 - Event unit

Some PMUs may have a common scale and unit values for all their
supported events. For those cases, the scale and unit properties for
those events must be inherited from the PMU.

The event offset in the memory is where the counter data gets
accumulated.

The OPAL-side patches are posted upstream :
https://lists.ozlabs.org/pipermail/skiboot/2017-March/006531.html

The kernel discovers the IMC counters information in the device tree
at the "imc-counters" device node which has a compatible field
"ibm,opal-in-memory-counters".

Parsing of the Events' information:
To parse the IMC PMUs and events information, the kernel has to
discover the "imc-counters" node and walk through the pmu and event
nodes.

Here is an excerpt of the dt showing the imc-counters with
mcs0 (nest), core and thread node:

https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts

/dts-v1/;

[...]

/dts-v1/;

/ {
name = "";
compatible = "ibm,opal-in-memory-counters";
#address-cells = <0x1>;
#size-cells = <0x1>;
imc-nest-offset = <0x32>;
imc-nest-size = <0x3>;
version-id = "";

NEST_MCS: nest-mcs-events {
#address-cells = <0x1>;
#size-cells = <0x1>;

event@0 {
event-name = "RRTO_QFULL_NO_DISP" ;
reg = <0x0 0x8>;
desc = "RRTO not dispatched in MCS0 due to capacity - 
pulses once for each time a valid RRTO op is not dispatched due to a command 
list full condition" ;
};
event@8 {
event-name = "WRTO_QFULL_NO_DISP" ;
reg = <0x8 0x8>;
desc = "WRTO not dispatched in MCS0 due to capacity - 
pulses once for each time a valid WRTO op is not dispatched due to a command 
list full condition" ;
};
[...]
mcs0 {
compatible = "ibm,imc-counters-nest";
events-prefix = "PM_MCS0_";
unit = "";
scale = "";
reg = <0x118 0x8>;
events = < &NEST_MCS >;
};

mcs1 {
compatible = "ibm,imc-counters-nest";
events-prefix = "PM_MCS1_";
unit = "";
scale = "";
reg = <0x198 0x8>;
events = < &NEST_MCS >;
};
[...]

CORE_EVENTS: core-events {
#address-cells = <0x1>;
#size-cells = <0x1>;

event@e0 {
event-name = "0THRD_NON_IDLE_PCYC" ;
reg = <0xe0 0x8>;
desc = "The number of processor cycles when all threads 
are idle" ;
};
event@120 {
event-name = "1THRD_NON_IDLE_PCYC" ;
reg = <0x120 0x8>;
desc = "The number of processor cycles when exactly one 
SMT thread is executing non-idle code" ;
};
[...]
   core {
compatible = "ibm,imc-counters-core";
events-prefix = "CPM_";
unit = "";
scale = "";
reg = <0x0 0x8>;
events = < &CORE_EVENTS >;
};

thread {
c

Re: [RFC PATCH 2/5] soc/fsl/qbman: Use shared-dma-pool for QMan private memory allocations

2017-04-03 Thread Robin Murphy
On 01/04/17 08:25, Scott Wood wrote:
> On Fri, 2017-03-31 at 18:55 +0100, Robin Murphy wrote:
>> On 31/03/17 04:27, Michael Ellerman wrote:
>>>
>>> Robin Murphy  writes:
>>>

 Hi Roy,

 On 29/03/17 22:13, Roy Pledge wrote:
>
> Use the shared-memory-pool mechanism for frame queue descriptor and
> packed frame descriptor record area allocations.
 Thanks for persevering with this - in my opinion it's now looking like
 it was worth the effort :)

 AFAICS the ioremap_wc() that this leads to does appear to give back
 something non-cacheable on PPC (assuming "pgprot_noncached_wc" isn't
 horrendously misnamed), and "no-map" should rule out any cacheable
 linear map alias existing, so it would seem that this approach should
 avert Scott's concerns about attribute mismatches.
>>> How does 'no-map' translate into something being excluded from the
>>> linear mapping?
>> Reserved regions marked with "no-map" get memblock_remove()d by
>> early_init_dt_alloc_reserved_memory_arch(). As I understand things, the
>> linear map should only cover memblock areas, and it would be explicitly
>> violating the semantics of "no-map" to still cover such a region.
> 
> Discontiguous memory isn't supported on these PPC chips.  Everything up to
> memblock_end_of_DRAM() gets mapped -- and if that were to change, the
> fragmentation would waste TLB1 entries.

Ah, so the "PPC-specific angles I'm not aware of" category is indeed
non-empty - I guess the lack of HAVE_GENERIC_DMA_COHERENT might be
related, then.

That said, though, AFAICS only certain x86 and s390 configurations ever
call memblock_set_bottom_up(true), so we should be able to assume that
the reserved region allocations always fall through to
__memblock_find_range_top_down(). Thus if your DRAM is contiguous, then
"no-map"-ing the reserved regions will simply end up pushing
memblock_end_of_DRAM() down in a manner that would appear to still avoid
overlaps. I can only see that going wrong if the end of DRAM wasn't at
least 32MB aligned to begin with - is that ever likely to happen in
practice?

Robin.

> This also breaks compatibility with existing device trees.  I suggest putting
> an ifdef in the qbman driver to add the new scheme for non-PPC arches only.
> 
> -Scott
> 



Re: [PATCH] powerpc/misc: fix exported functions that reference the TOC

2017-04-03 Thread Michael Ellerman
Oliver O'Halloran  writes:

> When the kernel is compiled to use 64bit ABIv2 the _GLOBAL() macro does not
> include a global entry point. A function's global entry point is used when the
> function is called from a different TOC context and in the kernel this
> typically means a call from a module into the vmlinux (or vis-a-vis).
>
> There are a few exported ASM functions declared with _GLOBAL() and calling
> them from a module will module will likely crash the kernel since any TOC
> relative load will yield garbage.
>
> To fix this use _GLOBAL_TOC() for exported asm functions rather than _GLOBAL()
> and some documentation about when to use each.

I wonder if we should just change _GLOBAL() to include the global entry
point. Persisting with _GLOBAL_TOC() seems like it's just going to be a
game of whack-a-mole.

Just grepping now I see ~50 functions defined with _GLOBAL() which are
also EXPORT'ed. Now presumably none of them use the TOC? But there's no
way to verify it, other than inspecting each one, and there's no way to
ensure we don't break it again in future.

The other option would be just to make a rule that anything EXPORT'ed
must use _GLOBAL_TOC().

cheers


Re: [PATCH V3 7/7] cxl: Add psl9 specific code

2017-04-03 Thread Frederic Barrat



Le 28/03/2017 à 17:14, Christophe Lombard a écrit :

The new Coherent Accelerator Interface Architecture, level 2, for the
IBM POWER9 brings new content and features:
- POWER9 Service Layer
- Registers
- Radix mode
- Process element entry
- Dedicated-Shared Process Programming Model
- Translation Fault Handling
- CAPP
- Memory Context ID
If a valid mm_struct is found the memory context id is used for each
transaction associated with the process handle. The PSL uses the
context ID to find the corresponding process element.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/context.c |  16 ++-
 drivers/misc/cxl/cxl.h | 137 +
 drivers/misc/cxl/debugfs.c |  19 
 drivers/misc/cxl/fault.c   |  78 --
 drivers/misc/cxl/guest.c   |   8 +-
 drivers/misc/cxl/irq.c |  53 ++
 drivers/misc/cxl/native.c  | 247 -
 drivers/misc/cxl/pci.c | 246 +---
 drivers/misc/cxl/trace.h   |  43 
 9 files changed, 752 insertions(+), 95 deletions(-)

diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index ac2531e..45363be 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -188,12 +188,24 @@ int cxl_context_iomap(struct cxl_context *ctx, struct 
vm_area_struct *vma)
if (ctx->afu->current_mode == CXL_MODE_DEDICATED) {
if (start + len > ctx->afu->adapter->ps_size)
return -EINVAL;
+
+   if (cxl_is_psl9(ctx->afu)) {
+   /* make sure there is a valid problem state
+* area space for this AFU
+*/
+   if (ctx->master && !ctx->afu->psa) {
+   pr_devel("AFU doesn't support mmio space\n");
+   return -EINVAL;
+   }
+
+   /* Can't mmap until the AFU is enabled */
+   if (!ctx->afu->enabled)
+   return -EBUSY;
+   }
} else {
if (start + len > ctx->psn_size)
return -EINVAL;
-   }

-   if (ctx->afu->current_mode != CXL_MODE_DEDICATED) {
/* make sure there is a valid per process space for this AFU */
if ((ctx->master && !ctx->afu->psa) || (!ctx->afu->pp_psa)) {
pr_devel("AFU doesn't support mmio space\n");
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 2f2d9e4..aac0504 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -63,7 +63,7 @@ typedef struct {
 /* Memory maps. Ref CXL Appendix A */

 /* PSL Privilege 1 Memory Map */
-/* Configuration and Control area */
+/* Configuration and Control area - CAIA 1&2 */
 static const cxl_p1_reg_t CXL_PSL_CtxTime = {0x};
 static const cxl_p1_reg_t CXL_PSL_ErrIVTE = {0x0008};
 static const cxl_p1_reg_t CXL_PSL_KEY1= {0x0010};
@@ -98,11 +98,29 @@ static const cxl_p1_reg_t CXL_XSL_Timebase  = {0x0100};
 static const cxl_p1_reg_t CXL_XSL_TB_CTLSTAT = {0x0108};
 static const cxl_p1_reg_t CXL_XSL_FEC   = {0x0158};
 static const cxl_p1_reg_t CXL_XSL_DSNCTL= {0x0168};
+/* PSL registers - CAIA 2 */
+static const cxl_p1_reg_t CXL_PSL9_CONTROL  = {0x0020};
+static const cxl_p1_reg_t CXL_XSL9_DSNCTL   = {0x0168};
+static const cxl_p1_reg_t CXL_PSL9_FIR1 = {0x0300};
+static const cxl_p1_reg_t CXL_PSL9_FIR2 = {0x0308};
+static const cxl_p1_reg_t CXL_PSL9_Timebase = {0x0310};
+static const cxl_p1_reg_t CXL_PSL9_DEBUG= {0x0320};
+static const cxl_p1_reg_t CXL_PSL9_FIR_CNTL = {0x0348};
+static const cxl_p1_reg_t CXL_PSL9_DSNDCTL  = {0x0350};
+static const cxl_p1_reg_t CXL_PSL9_TB_CTLSTAT = {0x0340};
+static const cxl_p1_reg_t CXL_PSL9_TRACECFG = {0x0368};
+static const cxl_p1_reg_t CXL_PSL9_APCDEDALLOC = {0x0378};
+static const cxl_p1_reg_t CXL_PSL9_APCDEDTYPE = {0x0380};
+static const cxl_p1_reg_t CXL_PSL9_TNR_ADDR = {0x0388};
+static const cxl_p1_reg_t CXL_PSL9_GP_CT = {0x0398};
+static const cxl_p1_reg_t CXL_XSL9_IERAT = {0x0588};
+static const cxl_p1_reg_t CXL_XSL9_ILPP  = {0x0590};
+
 /* 0x7F00:7FFF Reserved PCIe MSI-X Pending Bit Array area */
 /* 0x8000: Reserved PCIe MSI-X Table Area */

 /* PSL Slice Privilege 1 Memory Map */
-/* Configuration Area */
+/* Configuration Area - CAIA 1&2 */
 static const cxl_p1n_reg_t CXL_PSL_SR_An  = {0x00};
 static const cxl_p1n_reg_t CXL_PSL_LPID_An= {0x08};
 static const cxl_p1n_reg_t CXL_PSL_AMBAR_An   = {0x10};
@@ -111,17 +129,18 @@ static const cxl_p1n_reg_t CXL_PSL_ID_An  = 
{0x20};
 static const cxl_p1n_reg_t CXL_PSL_SERR_An= {0x28};
 /* Memory Management and Lookaside Buffer Management - CAIA 1*/
 static const cxl_p1n_reg_t CXL_PSL_SDR_An = {0x30};
+/* Memory Management and Lookaside Buffer Management - CAIA 1&2 */
 static const cxl_p1n_reg_t CXL_PSL_A

Re: [PATCH V3 6/7] cxl: Isolate few psl8 specific calls

2017-04-03 Thread Frederic Barrat



Le 28/03/2017 à 17:14, Christophe Lombard a écrit :

Point out the specific Coherent Accelerator Interface Architecture,
level 1, registers.
Code and functions specific to PSL8 (CAIA1) must be framed.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/context.c | 28 +++-
 drivers/misc/cxl/cxl.h | 35 +++--
 drivers/misc/cxl/debugfs.c |  6 +++--
 drivers/misc/cxl/fault.c   | 14 +-
 drivers/misc/cxl/native.c  | 58 ++---
 drivers/misc/cxl/pci.c | 64 +++---
 6 files changed, 136 insertions(+), 69 deletions(-)

diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 2e935ea..ac2531e 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -39,23 +39,26 @@ int cxl_context_init(struct cxl_context *ctx, struct 
cxl_afu *afu, bool master)
 {
int i;

-   spin_lock_init(&ctx->sste_lock);
ctx->afu = afu;
ctx->master = master;
ctx->pid = NULL; /* Set in start work ioctl */
mutex_init(&ctx->mapping_lock);
ctx->mapping = NULL;

-   /*
-* Allocate the segment table before we put it in the IDR so that we
-* can always access it when dereferenced from IDR. For the same
-* reason, the segment table is only destroyed after the context is
-* removed from the IDR.  Access to this in the IOCTL is protected by
-* Linux filesytem symantics (can't IOCTL until open is complete).
-*/
-   i = cxl_alloc_sst(ctx);
-   if (i)
-   return i;
+   if (cxl_is_psl8(afu)) {
+   spin_lock_init(&ctx->sste_lock);
+
+   /*
+* Allocate the segment table before we put it in the IDR so 
that we
+* can always access it when dereferenced from IDR. For the same
+* reason, the segment table is only destroyed after the 
context is
+* removed from the IDR.  Access to this in the IOCTL is 
protected by
+* Linux filesytem symantics (can't IOCTL until open is 
complete).
+*/
+   i = cxl_alloc_sst(ctx);
+   if (i)
+   return i;
+   }

INIT_WORK(&ctx->fault_work, cxl_handle_fault);

@@ -308,7 +311,8 @@ static void reclaim_ctx(struct rcu_head *rcu)
 {
struct cxl_context *ctx = container_of(rcu, struct cxl_context, rcu);

-   free_page((u64)ctx->sstp);
+   if (cxl_is_psl8(ctx->afu))
+   free_page((u64)ctx->sstp);
if (ctx->ff_page)
__free_page(ctx->ff_page);
ctx->sstp = NULL;
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index c7dd315..2f2d9e4 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h



This is very minor, but a lot of the changes in the comments below could 
probably be merged in the next patch when the CAIA2 version of those 
registers is introduced. No big deal though.




@@ -73,7 +73,7 @@ static const cxl_p1_reg_t CXL_PSL_Control = {0x0020};
 static const cxl_p1_reg_t CXL_PSL_DLCNTL  = {0x0060};
 static const cxl_p1_reg_t CXL_PSL_DLADDR  = {0x0068};

-/* PSL Lookaside Buffer Management Area */
+/* PSL Lookaside Buffer Management Area - CAIA 1 */
 static const cxl_p1_reg_t CXL_PSL_LBISEL  = {0x0080};
 static const cxl_p1_reg_t CXL_PSL_SLBIE   = {0x0088};
 static const cxl_p1_reg_t CXL_PSL_SLBIA   = {0x0090};
@@ -82,7 +82,7 @@ static const cxl_p1_reg_t CXL_PSL_TLBIA   = {0x00A8};
 static const cxl_p1_reg_t CXL_PSL_AFUSEL  = {0x00B0};

 /* 0x00C0:7EFF Implementation dependent area */
-/* PSL registers */
+/* PSL registers - CAIA 1 */
 static const cxl_p1_reg_t CXL_PSL_FIR1  = {0x0100};
 static const cxl_p1_reg_t CXL_PSL_FIR2  = {0x0108};
 static const cxl_p1_reg_t CXL_PSL_Timebase  = {0x0110};
@@ -109,7 +109,7 @@ static const cxl_p1n_reg_t CXL_PSL_AMBAR_An   = {0x10};
 static const cxl_p1n_reg_t CXL_PSL_SPOffset_An= {0x18};
 static const cxl_p1n_reg_t CXL_PSL_ID_An  = {0x20};
 static const cxl_p1n_reg_t CXL_PSL_SERR_An= {0x28};
-/* Memory Management and Lookaside Buffer Management */
+/* Memory Management and Lookaside Buffer Management - CAIA 1*/
 static const cxl_p1n_reg_t CXL_PSL_SDR_An = {0x30};
 static const cxl_p1n_reg_t CXL_PSL_AMOR_An= {0x38};
 /* Pointer Area */
@@ -124,6 +124,7 @@ static const cxl_p1n_reg_t CXL_PSL_IVTE_Limit_An  = {0xB8};
 /* 0xC0:FF Implementation Dependent Area */
 static const cxl_p1n_reg_t CXL_PSL_FIR_SLICE_An   = {0xC0};
 static const cxl_p1n_reg_t CXL_AFU_DEBUG_An   = {0xC8};
+/* 0xC0:FF Implementation Dependent Area - CAIA 1 */
 static const cxl_p1n_reg_t CXL_PSL_APCALLOC_A = {0xD0};
 static const cxl_p1n_reg_t CXL_PSL_COALLOC_A  = {0xD8};
 static const cxl_p1n_reg_t CXL_PSL_RXCTL_A= {0xE0};
@@ -133,12 +134,14 @@ static const cxl_p1n_reg_t CXL_PSL_SLICE_TRACE= 
{0xE8};
 /* Configuration and 

Re: [PATCH V3 5/7] cxl: Rename some psl8 specific functions

2017-04-03 Thread Frederic Barrat



Le 28/03/2017 à 17:14, Christophe Lombard a écrit :

Rename a few functions, changing the '_psl' suffix to '_psl8', to make
clear that the implementation is psl8 specific.
Those functions will have an equivalent implementation for the psl9 in
a later patch.

Signed-off-by: Christophe Lombard 
---



Acked-by: Frederic Barrat 





 drivers/misc/cxl/cxl.h | 26 +-
 drivers/misc/cxl/debugfs.c |  6 +++---
 drivers/misc/cxl/guest.c   |  2 +-
 drivers/misc/cxl/irq.c |  2 +-
 drivers/misc/cxl/native.c  | 12 ++--
 drivers/misc/cxl/pci.c | 46 +++---
 6 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 65d9fb6..c7dd315 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -813,10 +813,10 @@ int afu_register_irqs(struct cxl_context *ctx, u32 count);
 void afu_release_irqs(struct cxl_context *ctx, void *cookie);
 void afu_irq_name_free(struct cxl_context *ctx);

-int cxl_attach_afu_directed_psl(struct cxl_context *ctx, u64 wed, u64 amr);
-int cxl_activate_dedicated_process_psl(struct cxl_afu *afu);
-int cxl_attach_dedicated_process_psl(struct cxl_context *ctx, u64 wed, u64 
amr);
-void cxl_update_dedicated_ivtes_psl(struct cxl_context *ctx);
+int cxl_attach_afu_directed_psl8(struct cxl_context *ctx, u64 wed, u64 amr);
+int cxl_activate_dedicated_process_psl8(struct cxl_afu *afu);
+int cxl_attach_dedicated_process_psl8(struct cxl_context *ctx, u64 wed, u64 
amr);
+void cxl_update_dedicated_ivtes_psl8(struct cxl_context *ctx);

 #ifdef CONFIG_DEBUG_FS

@@ -826,10 +826,10 @@ int cxl_debugfs_adapter_add(struct cxl *adapter);
 void cxl_debugfs_adapter_remove(struct cxl *adapter);
 int cxl_debugfs_afu_add(struct cxl_afu *afu);
 void cxl_debugfs_afu_remove(struct cxl_afu *afu);
-void cxl_stop_trace_psl(struct cxl *cxl);
-void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir);
+void cxl_stop_trace_psl8(struct cxl *cxl);
+void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry 
*dir);
 void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir);
-void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir);
+void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct dentry *dir);

 #else /* CONFIG_DEBUG_FS */

@@ -860,11 +860,11 @@ static inline void cxl_debugfs_afu_remove(struct cxl_afu 
*afu)
 {
 }

-static inline void cxl_stop_trace(struct cxl *cxl)
+static inline void cxl_stop_trace_psl8(struct cxl *cxl)
 {
 }

-static inline void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter,
+static inline void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter,
struct dentry *dir)
 {
 }
@@ -874,7 +874,7 @@ static inline void cxl_debugfs_add_adapter_regs_xsl(struct 
cxl *adapter,
 {
 }

-static inline void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct 
dentry *dir)
+static inline void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct 
dentry *dir)
 {
 }

@@ -932,8 +932,8 @@ struct cxl_irq_info {
 };

 void cxl_assign_psn_space(struct cxl_context *ctx);
-int cxl_invalidate_all_psl(struct cxl *adapter);
-irqreturn_t cxl_irq_psl(int irq, struct cxl_context *ctx, struct cxl_irq_info 
*irq_info);
+int cxl_invalidate_all_psl8(struct cxl *adapter);
+irqreturn_t cxl_irq_psl8(int irq, struct cxl_context *ctx, struct cxl_irq_info 
*irq_info);
 irqreturn_t cxl_fail_irq_psl(struct cxl_afu *afu, struct cxl_irq_info 
*irq_info);
 int cxl_register_one_irq(struct cxl *adapter, irq_handler_t handler,
void *cookie, irq_hw_number_t *dest_hwirq,
@@ -945,7 +945,7 @@ int cxl_data_cache_flush(struct cxl *adapter);
 int cxl_afu_disable(struct cxl_afu *afu);
 int cxl_psl_purge(struct cxl_afu *afu);

-void cxl_native_irq_dump_regs_psl(struct cxl_context *ctx);
+void cxl_native_irq_dump_regs_psl8(struct cxl_context *ctx);
 void cxl_native_err_irq_dump_regs(struct cxl *adapter);
 int cxl_pci_vphb_add(struct cxl_afu *afu);
 void cxl_pci_vphb_remove(struct cxl_afu *afu);
diff --git a/drivers/misc/cxl/debugfs.c b/drivers/misc/cxl/debugfs.c
index 4848ebf..2ff10a9 100644
--- a/drivers/misc/cxl/debugfs.c
+++ b/drivers/misc/cxl/debugfs.c
@@ -15,7 +15,7 @@

 static struct dentry *cxl_debugfs;

-void cxl_stop_trace_psl(struct cxl *adapter)
+void cxl_stop_trace_psl8(struct cxl *adapter)
 {
int slice;

@@ -53,7 +53,7 @@ static struct dentry *debugfs_create_io_x64(const char *name, 
umode_t mode,
  (void __force *)value, &fops_io_x64);
 }

-void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir)
+void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry *dir)
 {
debugfs_create_io_x64("fir1", S_IRUSR, dir, _cxl_p1_addr(adapter, 
CXL_PSL_FIR1));
debugfs_create_io_x64("fir2", S_IRUSR, dir, _cxl_p1_addr(adapter, 
CXL_PSL_FIR2));
@@ -9

Re: [PATCH V3 4/7] cxl: Update implementation service layer

2017-04-03 Thread Frederic Barrat



Le 28/03/2017 à 17:14, Christophe Lombard a écrit :

The service layer API (in cxl.h) lists some low-level functions whose
implementation is different on PSL8, PSL9 and XSL:
- Init implementation for the adapter and the afu.
- Invalidate TLB/SLB.
- Attach process for dedicated/directed models.
- Handle psl interrupts.
- Debug registers for the adapter and the afu.
- Traces.
Each environment implements its own functions, and the common code uses
them through function pointers, defined in cxl_service_layer_ops.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/cxl.h | 40 ---
 drivers/misc/cxl/debugfs.c | 16 +++---
 drivers/misc/cxl/guest.c   |  2 +-
 drivers/misc/cxl/irq.c |  2 +-
 drivers/misc/cxl/native.c  | 52 --
 drivers/misc/cxl/pci.c | 47 -
 6 files changed, 102 insertions(+), 57 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index d8282e4..65d9fb6 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -553,13 +553,23 @@ struct cxl_context {
struct mm_struct *mm;
 };

+struct cxl_irq_info;
+
 struct cxl_service_layer_ops {
int (*adapter_regs_init)(struct cxl *adapter, struct pci_dev *dev);
+   int (*invalidate_all)(struct cxl *adapter);
int (*afu_regs_init)(struct cxl_afu *afu);
+   int (*sanitise_afu_regs)(struct cxl_afu *afu);
int (*register_serr_irq)(struct cxl_afu *afu);
void (*release_serr_irq)(struct cxl_afu *afu);
-   void (*debugfs_add_adapter_sl_regs)(struct cxl *adapter, struct dentry 
*dir);
-   void (*debugfs_add_afu_sl_regs)(struct cxl_afu *afu, struct dentry 
*dir);
+   irqreturn_t (*handle_interrupt)(int irq, struct cxl_context *ctx, 
struct cxl_irq_info *irq_info);
+   irqreturn_t (*fail_irq)(struct cxl_afu *afu, struct cxl_irq_info 
*irq_info);
+   int (*activate_dedicated_process)(struct cxl_afu *afu);
+   int (*attach_afu_directed)(struct cxl_context *ctx, u64 wed, u64 amr);
+   int (*attach_dedicated_process)(struct cxl_context *ctx, u64 wed, u64 
amr);
+   void (*update_dedicated_ivtes)(struct cxl_context *ctx);
+   void (*debugfs_add_adapter_regs)(struct cxl *adapter, struct dentry 
*dir);
+   void (*debugfs_add_afu_regs)(struct cxl_afu *afu, struct dentry *dir);



Some of those callbacks should be defined for the xsl_ops structure 
(Mellanox Cx4). We don't really support CX4 any more, we'll probably 
remove the code once CX5 is upstream, but for the time being, we should 
keep it around.






void (*psl_irq_dump_registers)(struct cxl_context *ctx);
void (*err_irq_dump_registers)(struct cxl *adapter);
void (*debugfs_stop_trace)(struct cxl *adapter);
@@ -803,6 +813,11 @@ int afu_register_irqs(struct cxl_context *ctx, u32 count);
 void afu_release_irqs(struct cxl_context *ctx, void *cookie);
 void afu_irq_name_free(struct cxl_context *ctx);

+int cxl_attach_afu_directed_psl(struct cxl_context *ctx, u64 wed, u64 amr);
+int cxl_activate_dedicated_process_psl(struct cxl_afu *afu);
+int cxl_attach_dedicated_process_psl(struct cxl_context *ctx, u64 wed, u64 
amr);
+void cxl_update_dedicated_ivtes_psl(struct cxl_context *ctx);
+
 #ifdef CONFIG_DEBUG_FS

 int cxl_debugfs_init(void);
@@ -811,10 +826,10 @@ int cxl_debugfs_adapter_add(struct cxl *adapter);
 void cxl_debugfs_adapter_remove(struct cxl *adapter);
 int cxl_debugfs_afu_add(struct cxl_afu *afu);
 void cxl_debugfs_afu_remove(struct cxl_afu *afu);
-void cxl_stop_trace(struct cxl *cxl);
-void cxl_debugfs_add_adapter_psl_regs(struct cxl *adapter, struct dentry *dir);
-void cxl_debugfs_add_adapter_xsl_regs(struct cxl *adapter, struct dentry *dir);
-void cxl_debugfs_add_afu_psl_regs(struct cxl_afu *afu, struct dentry *dir);
+void cxl_stop_trace_psl(struct cxl *cxl);
+void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir);
+void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir);
+void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir);

 #else /* CONFIG_DEBUG_FS */

@@ -849,17 +864,17 @@ static inline void cxl_stop_trace(struct cxl *cxl)
 {
 }

-static inline void cxl_debugfs_add_adapter_psl_regs(struct cxl *adapter,
+static inline void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter,
struct dentry *dir)
 {
 }

-static inline void cxl_debugfs_add_adapter_xsl_regs(struct cxl *adapter,
+static inline void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter,
struct dentry *dir)
 {
 }

-static inline void cxl_debugfs_add_afu_psl_regs(struct cxl_afu *afu, struct 
dentry *dir)
+static inline void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct 
dentry *dir)
 {
 }

@@ -917,19 +932,20 @@ struct cxl_irq_info {
 };

 void cxl_assign_psn_space(struct cxl_context *ctx);
-

Re: [PATCH V3 3/7] cxl: Keep track of mm struct associated with a context

2017-04-03 Thread Frederic Barrat



Le 28/03/2017 à 17:14, Christophe Lombard a écrit :

The mm_struct corresponding to the current task is acquired each time
an interrupt is raised. So to simplify the code, we only get the
mm_struct when attaching an AFU context to the process.
The mm_count reference is increased to ensure that the mm_struct can't
be freed. The mm_struct will be released when the context is detached.
A reference on mm_users is not kept to avoid a circular dependency if
the process mmaps its cxl mmio and forget to unmap before exiting.



Commit msg should mention that we also remove glpid because it's no 
longer useful now that we track the mm_struct.


Another detail in cxl.h below



Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/api.c | 17 +--
 drivers/misc/cxl/context.c | 21 +++--
 drivers/misc/cxl/cxl.h | 13 ++--
 drivers/misc/cxl/fault.c   | 76 --
 drivers/misc/cxl/file.c| 15 +++--
 drivers/misc/cxl/main.c| 12 ++--
 6 files changed, 64 insertions(+), 90 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index bcc030e..1a138c8 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "cxl.h"

@@ -321,19 +322,29 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,

if (task) {
ctx->pid = get_task_pid(task, PIDTYPE_PID);
-   ctx->glpid = get_task_pid(task->group_leader, PIDTYPE_PID);
kernel = false;
ctx->real_mode = false;
+
+   /* acquire a reference to the task's mm */
+   ctx->mm = get_task_mm(current);
+
+   /* ensure this mm_struct can't be freed */
+   cxl_context_mm_count_get(ctx);
+
+   /* decrement the use count */
+   if (ctx->mm)
+   mmput(ctx->mm);
}

cxl_ctx_get();

if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) {
-   put_pid(ctx->glpid);
put_pid(ctx->pid);
-   ctx->glpid = ctx->pid = NULL;
+   ctx->pid = NULL;
cxl_adapter_context_put(ctx->afu->adapter);
cxl_ctx_put();
+   if (task)
+   cxl_context_mm_count_put(ctx);
goto out;
}

diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 062bf6c..2e935ea 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -41,7 +42,7 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu 
*afu, bool master)
spin_lock_init(&ctx->sste_lock);
ctx->afu = afu;
ctx->master = master;
-   ctx->pid = ctx->glpid = NULL; /* Set in start work ioctl */
+   ctx->pid = NULL; /* Set in start work ioctl */
mutex_init(&ctx->mapping_lock);
ctx->mapping = NULL;

@@ -242,12 +243,16 @@ int __detach_context(struct cxl_context *ctx)

/* release the reference to the group leader and mm handling pid */
put_pid(ctx->pid);
-   put_pid(ctx->glpid);

cxl_ctx_put();

/* Decrease the attached context count on the adapter */
cxl_adapter_context_put(ctx->afu->adapter);
+
+   /* Decrease the mm count on the context */
+   cxl_context_mm_count_put(ctx);
+   ctx->mm = NULL;
+
return 0;
 }

@@ -325,3 +330,15 @@ void cxl_context_free(struct cxl_context *ctx)
mutex_unlock(&ctx->afu->contexts_lock);
call_rcu(&ctx->rcu, reclaim_ctx);
 }
+
+void cxl_context_mm_count_get(struct cxl_context *ctx)
+{
+   if (ctx->mm)
+   atomic_inc(&ctx->mm->mm_count);
+}
+
+void cxl_context_mm_count_put(struct cxl_context *ctx)
+{
+   if (ctx->mm)
+   mmdrop(ctx->mm);
+}
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 2bbe077..d8282e4 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -482,8 +482,6 @@ struct cxl_context {
unsigned int sst_size, sst_lru;

wait_queue_head_t wq;
-   /* pid of the group leader associated with the pid */
-   struct pid *glpid;
/* use mm context associated with this pid for ds faults */
struct pid *pid;
spinlock_t lock; /* Protects pending_irq_mask, pending_fault and 
fault_addr */
@@ -551,6 +549,8 @@ struct cxl_context {
 * CX4 only:
 */
struct list_head extra_irq_contexts;
+
+   struct mm_struct *mm;
 };

 struct cxl_service_layer_ops {
@@ -1025,4 +1025,13 @@ int cxl_adapter_context_lock(struct cxl *adapter);
 /* Unlock the contexts-lock if taken. Warn and force unlock otherwise */
 void cxl_adapter_context_unlock(struct cxl *adapter);

+/* Increases the reference count to "struct mm_struct" */
+void cxl_context_mm_count_get(struct cxl_context *ctx);
+
+/* Decreme

Re: [PATCH V3 2/7] cxl: Remove unused values in bare-metal environment.

2017-04-03 Thread Frederic Barrat



Le 28/03/2017 à 17:14, Christophe Lombard a écrit :

The two fields pid and tid of the structure cxl_irq_info are only used
in the guest environment. To avoid confusion, it's not necessary
to fill the fields in the bare-metal environment. These two fields
are renamed to 'reserved' to avoid undefined behavior on bare-metal.
The PSL Process and Thread Identification Register (CXL_PSL_PID_TID_An)
is only used when attaching a dedicated process for PSL8 only. This
register goes away in CAIA2.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/cxl.h| 13 +++--
 drivers/misc/cxl/hcalls.c |  4 ++--
 drivers/misc/cxl/native.c |  5 -
 3 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 79e60ec..2bbe077 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -895,19 +895,20 @@ int __detach_context(struct cxl_context *ctx);
  * plpar_hcall9() in hvCall.S
  * As a consequence:
  * - we don't need to do any endianness conversion
- * - the pid and tid are an exception. They are 32-bit values returned in
- *   the same 64-bit register. So we do need to worry about byte ordering.
+ * - the pid (reserved0) and tid (reserved1) are an exception. They are
+ *   32-bit values returned in the same 64-bit register. So we do need
+ *   to worry about byte ordering.
  */
 struct cxl_irq_info {
u64 dsisr;
u64 dar;
u64 dsr;
 #ifndef CONFIG_CPU_LITTLE_ENDIAN
-   u32 pid;
-   u32 tid;
+   u32 reserved0;
+   u32 reserved1;
 #else
-   u32 tid;
-   u32 pid;
+   u32 reserved1;
+   u32 reserved0;
 #endif



The resulting structure looks weird/over-complicated. I think we could 
get rid of the #ifdef about endianess by just declaring:


struct cxl_irq_info {
u64 dsisr;
u64 dar;
u64 dsr;
u64 pid_tid
u64 afu_err;
...

and just print the 64-bit value in hcalls.c and let the reader figure 
out the order. It's just informational anyway.



  Fred




u64 afu_err;
u64 errstat;
diff --git a/drivers/misc/cxl/hcalls.c b/drivers/misc/cxl/hcalls.c
index d6d11f4..142a095 100644
--- a/drivers/misc/cxl/hcalls.c
+++ b/drivers/misc/cxl/hcalls.c
@@ -414,8 +414,8 @@ long cxl_h_collect_int_info(u64 unit_address, u64 
process_token,
switch (rc) {
case H_SUCCESS: /* The interrupt info is returned in return 
registers. */
pr_devel("dsisr:%#llx, dar:%#llx, dsr:%#llx, pid:%u, tid:%u, 
afu_err:%#llx, errstat:%#llx\n",
-   info->dsisr, info->dar, info->dsr, info->pid,
-   info->tid, info->afu_err, info->errstat);
+   info->dsisr, info->dar, info->dsr, info->reserved0,
+   info->reserved1, info->afu_err, info->errstat);
return 0;
case H_PARAMETER:   /* An incorrect parameter was supplied. */
return -EINVAL;
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 7ae7105..7257e8b 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -859,8 +859,6 @@ static int native_detach_process(struct cxl_context *ctx)

 static int native_get_irq_info(struct cxl_afu *afu, struct cxl_irq_info *info)
 {
-   u64 pidtid;
-
/* If the adapter has gone away, we can't get any meaningful
 * information.
 */
@@ -870,9 +868,6 @@ static int native_get_irq_info(struct cxl_afu *afu, struct 
cxl_irq_info *info)
info->dsisr = cxl_p2n_read(afu, CXL_PSL_DSISR_An);
info->dar = cxl_p2n_read(afu, CXL_PSL_DAR_An);
info->dsr = cxl_p2n_read(afu, CXL_PSL_DSR_An);
-   pidtid = cxl_p2n_read(afu, CXL_PSL_PID_TID_An);
-   info->pid = pidtid >> 32;
-   info->tid = pidtid & 0x;
info->afu_err = cxl_p2n_read(afu, CXL_AFU_ERR_An);
info->errstat = cxl_p2n_read(afu, CXL_PSL_ErrStat_An);
info->proc_handle = 0;





Re: [PATCH V3 1/7] cxl: Read vsec perst load image

2017-04-03 Thread Frederic Barrat



Le 28/03/2017 à 17:14, Christophe Lombard a écrit :

This bit is used to cause a flash image load for programmable
CAIA-compliant implementation. If this bit is set to ‘0’, a power
cycle of the adapter is required to load a programmable CAIA-com-
pliant implementation from flash.
This field will be used by the following patches.

Signed-off-by: Christophe Lombard 
---


Acked-by: Frederic Barrat 



 drivers/misc/cxl/pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 91f6459..e82a207 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1332,6 +1332,7 @@ static int cxl_read_vsec(struct cxl *adapter, struct 
pci_dev *dev)
CXL_READ_VSEC_IMAGE_STATE(dev, vsec, &image_state);
adapter->user_image_loaded = !!(image_state & 
CXL_VSEC_USER_IMAGE_LOADED);
adapter->perst_select_user = !!(image_state & 
CXL_VSEC_USER_IMAGE_LOADED);
+   adapter->perst_loads_image = !!(image_state & 
CXL_VSEC_PERST_LOADS_IMAGE);

CXL_READ_VSEC_NAFUS(dev, vsec, &adapter->slices);
CXL_READ_VSEC_AFU_DESC_OFF(dev, vsec, &afu_desc_off);





Re: [PATCH] raid6/altivec: adding vpermxor implementation for raid6 Q syndrome

2017-04-03 Thread Daniel Axtens
Hi Matt,

> The raid6 Q syndrome check has been optimised using the vpermxor
> instruction.  This instruction was made available with POWER8, ISA version
> 2.07. It allows for both vperm and vxor instructions to be done in a single
> instruction. This has been tested for correctness on a ppc64le vm with a
> basic RAID6 setup containing 5 drives.
>
> The performance benchmarks are from the raid6test in the /lib/raid6/test
> directory. These results are from an IBM Firestone machine with ppc64le
> architecture. The benchmark results show a 35% speed increase over the best
> existing algorithm for powerpc (altivec). The raid6test has also been run
> on a big-endian ppc64 vm to ensure it also works for big-endian
> architectures.
>
> Performance benchmarks:
>
>   raid6: altivecx4 gen() 18773 MB/s
>   raid6: altivecx8 gen() 19438 MB/s
>
>   raid6: vpermxor4 gen() 25112 MB/s
>   raid6: vpermxor8 gen() 26279 MB/s
>
Well done!

> Note: Also fixed a small bug in pq.h regarding a missing and mismatched
> ifdef statement

So this could be slightly better explained:

4f8c55c5ad49 ("lib/raid6: build proper files on corresponding arch")
breaks raid6test on ppc by rearranging the ifdefs incorrectly.

You could possibly split that fix out into its own separate commit, but
that is probably overkill.

You should, however, probably include just before your Signed-off-by:
Fixes: 4f8c55c5ad49 ("lib/raid6: build proper files on corresponding arch")

> Signed-off-by: Matt Brown 
> ---
>  include/linux/raid/pq.h |   4 ++
>  lib/raid6/Makefile  |  27 -
>  lib/raid6/algos.c   |   4 ++
>  lib/raid6/altivec.uc|   3 ++
>  lib/raid6/test/Makefile |  28 -
>  lib/raid6/vpermxor.uc   | 102 
> 
>  6 files changed, 165 insertions(+), 3 deletions(-)
>  create mode 100644 lib/raid6/vpermxor.uc
>
> diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
> index 4d57bba..3df9aa6 100644
> --- a/include/linux/raid/pq.h
> +++ b/include/linux/raid/pq.h
> @@ -107,6 +107,10 @@ extern const struct raid6_calls raid6_avx512x2;
>  extern const struct raid6_calls raid6_avx512x4;
>  extern const struct raid6_calls raid6_tilegx8;
>  extern const struct raid6_calls raid6_s390vx8;
> +extern const struct raid6_calls raid6_vpermxor1;
> +extern const struct raid6_calls raid6_vpermxor2;
> +extern const struct raid6_calls raid6_vpermxor4;
> +extern const struct raid6_calls raid6_vpermxor8;
>  
>  struct raid6_recov_calls {
>   void (*data2)(int, size_t, int, int, void **);
> diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
> index 3057011..7775aad 100644
> --- a/lib/raid6/Makefile
> +++ b/lib/raid6/Makefile
> @@ -4,7 +4,8 @@ raid6_pq-y+= algos.o recov.o tables.o int1.o int2.o 
> int4.o \
>  int8.o int16.o int32.o
>  
>  raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o 
> avx2.o avx512.o recov_avx512.o
> -raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
> +raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \
> + vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
>  raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
>  raid6_pq-$(CONFIG_TILEGX) += tilegx8.o
>  raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
> @@ -88,6 +89,30 @@ $(obj)/altivec8.c:   UNROLL := 8
>  $(obj)/altivec8.c:   $(src)/altivec.uc $(src)/unroll.awk FORCE
>   $(call if_changed,unroll)
>  
> +CFLAGS_vpermxor1.o += $(altivec_flags)
> +targets += vpermxor1.c
> +$(obj)/vpermxor1.c: UNROLL := 1
> +$(obj)/vpermxor1.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
> + $(call if_changed,unroll)
> +
> +CFLAGS_vpermxor2.o += $(altivec_flags)
> +targets += vpermxor2.c
> +$(obj)/vpermxor2.c: UNROLL := 2
> +$(obj)/vpermxor2.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
> + $(call if_changed,unroll)
> +
> +CFLAGS_vpermxor4.o += $(altivec_flags)
> +targets += vpermxor4.c
> +$(obj)/vpermxor4.c: UNROLL := 4
> +$(obj)/vpermxor4.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
> + $(call if_changed,unroll)
> +
> +CFLAGS_vpermxor8.o += $(altivec_flags)
> +targets += vpermxor8.c
> +$(obj)/vpermxor8.c: UNROLL := 8
> +$(obj)/vpermxor8.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
> + $(call if_changed,unroll)
> +
>  CFLAGS_neon1.o += $(NEON_FLAGS)
>  targets += neon1.c
>  $(obj)/neon1.c:   UNROLL := 1
> diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
> index 7857049..edd4f69 100644
> --- a/lib/raid6/algos.c
> +++ b/lib/raid6/algos.c
> @@ -74,6 +74,10 @@ const struct raid6_calls * const raid6_algos[] = {
>   &raid6_altivec2,
>   &raid6_altivec4,
>   &raid6_altivec8,
> + &raid6_vpermxor1,
> + &raid6_vpermxor2,
> + &raid6_vpermxor4,
> + &raid6_vpermxor8,
>  #endif
>  #if defined(CONFIG_TILEGX)
>   &raid6_tilegx8,
> diff --git a/lib/raid6/altivec.uc b/lib/raid6/altivec.uc
> index 682aae8..d20ed0d 100644
> --- a/lib/r

[PATCH] KVM: PPC: Book3S PR: Do not always inject facility unavailable exceptions

2017-04-03 Thread Thomas Huth
KVM should not inject a facility unavailable exception into the guest
when it tries to execute a mtspr/mfspr instruction for an SPR that
is unavailable, and the vCPU is *not* running in PRoblem state.

It's right that we inject an exception when the vCPU is in PR mode, since
chapter "6.2.10 Facility Status and Control Register" of the PowerISA
v2.07 says that "When the FSCR makes a facility unavailable, attempted
usage of the facility in *problem state* is treated as follows: [...]
Access of an SPR using mfspr/mtspr causes a Facility Unavailable
interrupt". But if the guest vCPU is not in PR mode, we should follow
the behavior that is described in chapter "4.4.4 Move To/From System
Register Instructions" instead and treat the instruction as a NOP.

Signed-off-by: Thomas Huth 
---
 arch/powerpc/kvm/book3s_pr.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index d4dfc0c..3e6c0b3 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -830,6 +831,31 @@ static void kvmppc_emulate_fac(struct kvm_vcpu *vcpu, 
ulong fac)
}
 }
 
+static void kvmppc_handle_fac_not_enabled(struct kvm_vcpu *vcpu, ulong fac)
+{
+   enum emulation_result er;
+   u32 inst;
+
+   if (kvmppc_get_msr(vcpu) & MSR_PR) {
+   kvmppc_trigger_fac_interrupt(vcpu, fac);
+   return;
+   }
+
+   er = kvmppc_get_last_inst(vcpu, INST_GENERIC, &inst);
+   if (er != EMULATE_DONE)
+   return;
+
+   if (get_op(inst) == 31 && (get_xop(inst) == OP_31_XOP_MTSPR ||
+  get_xop(inst) == OP_31_XOP_MFSPR)) {
+   /* mtspr and mfspr are treated as NOP for unsupported SPRs */
+   kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
+   pr_debug_ratelimited("%s: write/read of disabled SPR: %d\n",
+__func__, get_sprn(inst));
+   } else {
+   kvmppc_trigger_fac_interrupt(vcpu, fac);
+   }
+}
+
 /* Enable facilities (TAR, EBB, DSCR) for the guest */
 static int kvmppc_handle_fac(struct kvm_vcpu *vcpu, ulong fac)
 {
@@ -855,7 +881,7 @@ static int kvmppc_handle_fac(struct kvm_vcpu *vcpu, ulong 
fac)
 
if (!guest_fac_enabled) {
/* Facility not enabled by the guest */
-   kvmppc_trigger_fac_interrupt(vcpu, fac);
+   kvmppc_handle_fac_not_enabled(vcpu, fac);
return RESUME_GUEST;
}
 
-- 
1.8.3.1



[PATCH] KVM: PPC: Book3S PR: Do not fail emulation with mtspr/mfspr for unknown SPRs

2017-04-03 Thread Thomas Huth
According to the PowerISA 2.07, mtspr and mfspr should not generate
an illegal instruction exception when being used with an undefined SPR,
but rather treat the instruction as a NOP, inject a privilege exception
or an emulation assistance exception - depending on the SPR number.
Also turn the printk here into a ratelimited print statement, so that
the guest can not flood the dmesg log of the host by issueing lots of
illegal mtspr/mfspr instruction here.

Signed-off-by: Thomas Huth 
---
 arch/powerpc/kvm/book3s.c |  1 +
 arch/powerpc/kvm/book3s_emulate.c | 30 ++
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index b6b5c18..9b007f9 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -137,6 +137,7 @@ void kvmppc_inject_interrupt(struct kvm_vcpu *vcpu, int 
vec, u64 flags)
kvmppc_set_pc(vcpu, kvmppc_interrupt_offset(vcpu) + vec);
vcpu->arch.mmu.reset_msr(vcpu);
 }
+EXPORT_SYMBOL_GPL(kvmppc_inject_interrupt);
 
 static int kvmppc_book3s_vec2irqprio(unsigned int vec)
 {
diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 8359752..9c31e23 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -503,10 +503,16 @@ int kvmppc_core_emulate_mtspr_pr(struct kvm_vcpu *vcpu, 
int sprn, ulong spr_val)
break;
 unprivileged:
default:
-   printk(KERN_INFO "KVM: invalid SPR write: %d\n", sprn);
-#ifndef DEBUG_SPR
-   emulated = EMULATE_FAIL;
-#endif
+   pr_info_ratelimited("KVM: invalid SPR write: %d\n", sprn);
+   if (sprn & 0x10) {
+   if (kvmppc_get_msr(vcpu) & MSR_PR)
+   kvmppc_core_queue_program(vcpu, SRR1_PROGPRIV);
+   } else {
+   if ((kvmppc_get_msr(vcpu) & MSR_PR) || sprn == 0)
+   kvmppc_inject_interrupt(vcpu,
+   BOOK3S_INTERRUPT_H_EMUL_ASSIST,
+   0);
+   }
break;
}
 
@@ -648,10 +654,18 @@ int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu, 
int sprn, ulong *spr_val
break;
default:
 unprivileged:
-   printk(KERN_INFO "KVM: invalid SPR read: %d\n", sprn);
-#ifndef DEBUG_SPR
-   emulated = EMULATE_FAIL;
-#endif
+   pr_info_ratelimited("KVM: invalid SPR read: %d\n", sprn);
+   if (sprn & 0x10) {
+   if (kvmppc_get_msr(vcpu) & MSR_PR)
+   kvmppc_core_queue_program(vcpu, SRR1_PROGPRIV);
+   } else {
+   if ((kvmppc_get_msr(vcpu) & MSR_PR) || sprn == 0 ||
+   sprn == 4 || sprn == 5 || sprn == 6)
+   kvmppc_inject_interrupt(vcpu,
+   BOOK3S_INTERRUPT_H_EMUL_ASSIST,
+   0);
+   }
+
break;
}
 
-- 
1.8.3.1



Re: [v6, 01/19] powerpc/mm/slice: Convert slice_mask high slice to a bitmap

2017-04-03 Thread Michael Ellerman
On Thu, 2017-03-30 at 12:03:49 UTC, Michael Ellerman wrote:
> From: "Aneesh Kumar K.V" 
> 
> In followup patch we want to increase the va range which will result
> in us requiring high_slices to have more than 64 bits. To enable this
> convert high_slices to bitmap. We keep the number bits same in this patch
> and later change that to higher value
> 
> Signed-off-by: Aneesh Kumar K.V 
> [mpe: Fold in fix to use bitmap_empty()]
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/f3207c124e7aa8d4d9cf32cc45b10c

cheers


Re: [V3,01/10] powerpc/mm/nohash: MM_SLICE is only used by book3s 64

2017-04-03 Thread Michael Ellerman
On Tue, 2017-03-21 at 17:29:51 UTC, "Aneesh Kumar K.V" wrote:
> BOOKE code is dead code as per the Kconfig details. So make it simpler
> by enabling MM_SLICE only for book3s_64. The changes w.r.t nohash is just
> removing deadcode. W.r.t ppc64, 4k without hugetlb will now enable MM_SLICE.
> But that is good, because we reduce one extra variant which probably is not
> getting tested much.
> 
> Reviewed-by: Paul Mackerras 
> Signed-off-by: Aneesh Kumar K.V 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/b42279f0165cdfb093718ec368e56d

cheers


Re: [v2, 1/2] powerpc/fadump: reserve memory at an offset closer to bottom of RAM

2017-04-03 Thread Michael Ellerman
On Thu, 2017-03-16 at 21:05:26 UTC, Hari Bathini wrote:
> Currently, the area to preserve boot memory is reserved at the top of
> RAM. This leaves fadump vulnerable to memory hot-remove operations. As
> memory for fadump has to be reserved early in the boot process, fadump
> can't be registered after a memory hot-remove operation. Though this
> problem can't be eleminated completely, the impact can be minimized by
> reserving memory at an offset closer to bottom of the RAM. The offset
> for fadump memory reservation can be any value greater than fadump boot
> memory size.
> 
> Signed-off-by: Hari Bathini 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/f6e6bedb773118713a8eb3736a4c99

cheers


Re: powerpc/4xx: make sam440ep_setup_rtc init

2017-04-03 Thread Michael Ellerman
On Tue, 2016-04-26 at 16:49:38 UTC, yshi wrote:
> sam440ep_setup_rtc is just called by machine_device_initcall, and it calls
> i2c_register_board_info which is init too, so the lack of __init may cause
> mismatch warning when linking kernel.
> 
> Signed-off-by: Yang Shi 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/adec9a2e7ca3b4193818f9a7b39563

cheers


[PATCH v2 3/3] powerpc/powernv: Introduce address translation services for Nvlink2

2017-04-03 Thread Alistair Popple
Nvlink2 supports address translation services (ATS) allowing devices
to request address translations from an mmu known as the nest MMU
which is setup to walk the CPU page tables.

To access this functionality certain firmware calls are required to
setup and manage hardware context tables in the nvlink processing unit
(NPU). The NPU also manages forwarding of TLB invalidates (known as
address translation shootdowns/ATSDs) to attached devices.

This patch exports several methods to allow device drivers to register
a process id (PASID/PID) in the hardware tables and to receive
notification of when a device should stop issuing address translation
requests (ATRs). It also adds a fault handler to allow device drivers
to demand fault pages in.

Signed-off-by: Alistair Popple 
---

Changes since v1:

 - Moved exported function call prototypes to an externally accessable
   header.
 - Fixed invalidation of the NMMU PWC (this also fixes the
   reported build failure).

 arch/powerpc/include/asm/book3s/64/mmu.h   |   6 +
 arch/powerpc/include/asm/opal-api.h|   5 +-
 arch/powerpc/include/asm/opal.h|   5 +
 arch/powerpc/include/asm/powernv.h |  22 ++
 arch/powerpc/mm/mmu_context_book3s64.c |   1 +
 arch/powerpc/platforms/powernv/npu-dma.c   | 443 +
 arch/powerpc/platforms/powernv/opal-wrappers.S |   3 +
 arch/powerpc/platforms/powernv/pci-ioda.c  |   2 +
 arch/powerpc/platforms/powernv/pci.h   |  15 +-
 9 files changed, 500 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 805d4105..1676ec8 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -73,10 +73,16 @@ extern struct patb_entry *partition_tb;
 typedef unsigned long mm_context_id_t;
 struct spinlock;

+/* Maximum possible number of NPUs in a system. */
+#define NV_MAX_NPUS 8
+
 typedef struct {
mm_context_id_t id;
u16 user_psize; /* page size index */

+   /* NPU NMMU context */
+   struct npu_context *npu_context;
+
 #ifdef CONFIG_PPC_MM_SLICES
u64 low_slices_psize;   /* SLB page size encodings */
unsigned char high_slices_psize[SLICE_ARRAY_SIZE];
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index a0aa285..a599a2c 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -168,7 +168,10 @@
 #define OPAL_INT_SET_MFRR  125
 #define OPAL_PCI_TCE_KILL  126
 #define OPAL_NMMU_SET_PTCR 127
-#define OPAL_LAST  127
+#define OPAL_NPU_INIT_CONTEXT  146
+#define OPAL_NPU_DESTROY_CONTEXT   147
+#define OPAL_NPU_MAP_LPAR  148
+#define OPAL_LAST  148

 /* Device tree flags */

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 1ff03a6..b3b97c4 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -29,6 +29,11 @@ extern struct device_node *opal_node;

 /* API functions */
 int64_t opal_invalid_call(void);
+int64_t opal_npu_destroy_context(uint64_t phb_id, uint64_t pid, uint64_t bdf);
+int64_t opal_npu_init_context(uint64_t phb_id, int pasid, uint64_t msr,
+   uint64_t bdf);
+int64_t opal_npu_map_lpar(uint64_t phb_id, uint64_t bdf, uint64_t lparid,
+   uint64_t lpcr);
 int64_t opal_console_write(int64_t term_number, __be64 *length,
   const uint8_t *buffer);
 int64_t opal_console_read(int64_t term_number, __be64 *length,
diff --git a/arch/powerpc/include/asm/powernv.h 
b/arch/powerpc/include/asm/powernv.h
index 0e9c240..f627977 100644
--- a/arch/powerpc/include/asm/powernv.h
+++ b/arch/powerpc/include/asm/powernv.h
@@ -11,9 +11,31 @@
 #define _ASM_POWERNV_H

 #ifdef CONFIG_PPC_POWERNV
+#define NPU2_WRITE 1
 extern void powernv_set_nmmu_ptcr(unsigned long ptcr);
+extern struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
+   unsigned long flags,
+   struct npu_context *(*cb)(struct npu_context *, void *),
+   void *priv);
+extern void pnv_npu2_destroy_context(struct npu_context *context,
+   struct pci_dev *gpdev);
+extern int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea,
+   unsigned long *flags, unsigned long *status,
+   int count);
 #else
 static inline void powernv_set_nmmu_ptcr(unsigned long ptcr) { }
+static inline struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
+   unsigned long flags,
+   struct npu_context *(*cb)(struct npu_context *, void *),
+   void *priv) { return ERR_PTR(-ENODEV);

[PATCH v2 2/3] powerpc/powernv: Add sanity checks to pnv_pci_get_{gpu|npu}_dev

2017-04-03 Thread Alistair Popple
The pnv_pci_get_{gpu|npu}_dev functions are used to find associations
between nvlink PCIe devices and standard PCIe devices. However they
lacked basic sanity checking which results in NULL pointer
dereferencing if they are incorrect called can be harder to spot than
an explicit WARN_ON.

Signed-off-by: Alistair Popple 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 1c383f3..050bd5d 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -37,6 +37,12 @@ struct pci_dev *pnv_pci_get_gpu_dev(struct pci_dev *npdev)
struct device_node *dn;
struct pci_dev *gpdev;
 
+   if (WARN_ON(!npdev))
+   return NULL;
+
+   if (WARN_ON(!npdev->dev.of_node))
+   return NULL;
+
/* Get assoicated PCI device */
dn = of_parse_phandle(npdev->dev.of_node, "ibm,gpu", 0);
if (!dn)
@@ -55,6 +61,12 @@ struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, 
int index)
struct device_node *dn;
struct pci_dev *npdev;
 
+   if (WARN_ON(!gpdev))
+   return NULL;
+
+   if (WARN_ON(!gpdev->dev.of_node))
+   return NULL;
+
/* Get assoicated PCI device */
dn = of_parse_phandle(gpdev->dev.of_node, "ibm,npu", index);
if (!dn)
-- 
2.1.4



[PATCH v2 1/3] drivers/of/base.c: Add of_property_read_u64_index

2017-04-03 Thread Alistair Popple
There is of_property_read_u32_index but no u64 variant. This patch
adds one similar to the u32 version for u64.

Signed-off-by: Alistair Popple 
Acked-by: Rob Herring 
---
 drivers/of/base.c  | 31 +++
 include/linux/of.h |  3 +++
 2 files changed, 34 insertions(+)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index d7c4629..0ea16bd 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -1213,6 +1213,37 @@ int of_property_read_u32_index(const struct device_node 
*np,
 EXPORT_SYMBOL_GPL(of_property_read_u32_index);
 
 /**
+ * of_property_read_u64_index - Find and read a u64 from a multi-value 
property.
+ *
+ * @np:device node from which the property value is to be read.
+ * @propname:  name of the property to be searched.
+ * @index: index of the u64 in the list of values
+ * @out_value: pointer to return value, modified only if no error.
+ *
+ * Search for a property in a device node and read nth 64-bit value from
+ * it. Returns 0 on success, -EINVAL if the property does not exist,
+ * -ENODATA if property does not have a value, and -EOVERFLOW if the
+ * property data isn't large enough.
+ *
+ * The out_value is modified only if a valid u64 value can be decoded.
+ */
+int of_property_read_u64_index(const struct device_node *np,
+  const char *propname,
+  u32 index, u64 *out_value)
+{
+   const u64 *val = of_find_property_value_of_size(np, propname,
+   ((index + 1) * sizeof(*out_value)),
+   0, NULL);
+
+   if (IS_ERR(val))
+   return PTR_ERR(val);
+
+   *out_value = be64_to_cpup(((__be64 *)val) + index);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(of_property_read_u64_index);
+
+/**
  * of_property_read_variable_u8_array - Find and read an array of u8 from a
  * property, with bounds on the minimum and maximum array size.
  *
diff --git a/include/linux/of.h b/include/linux/of.h
index 21e6323..d08788d 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -292,6 +292,9 @@ extern int of_property_count_elems_of_size(const struct 
device_node *np,
 extern int of_property_read_u32_index(const struct device_node *np,
   const char *propname,
   u32 index, u32 *out_value);
+extern int of_property_read_u64_index(const struct device_node *np,
+  const char *propname,
+  u32 index, u64 *out_value);
 extern int of_property_read_variable_u8_array(const struct device_node *np,
const char *propname, u8 *out_values,
size_t sz_min, size_t sz_max);
-- 
2.1.4



Re: [PowerPC][next-20170324][kselftest] kernel Oops when running tm/tm-signal-context-chk-vsx

2017-04-03 Thread Abdul Haleem
On Mon, 2017-04-03 at 14:28 +0530, Abdul Haleem wrote:
> On Tue, 2017-03-28 at 21:00 +1100, Michael Ellerman wrote:
> > Abdul Haleem  writes:
> > 
> > > Hi,
> > >
> > > While running kernel self tests on ppc64, tm/tm-signal-context-chk-vsx
> > > tests fails with Oops message. 
> > >
> > > I was able to reproduce only twice out of 20 runs on next-20170324 only.
> > > so it is difficult to bisect the commit causing the issue.
> > 
> > Can you try mainline as of this commit:
> > 
> > 605df8d674ac ("selftests/powerpc: Replace stxvx and lxvx with 
> > stxvd2x/lxvd2x")
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=605df8d674ac65e044a0bf4998b28c2f350b7f9e
> > 
> > cheers
> > 
> 
> 
> Trace is not reproducible on mainline with above commit.
> 
> Cyril was able to reproduce it and is working on it.
> 
> 
> 
> 
cc Cyril Bur 

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





Re: [PowerPC][next-20170324][kselftest] kernel Oops when running tm/tm-signal-context-chk-vsx

2017-04-03 Thread Abdul Haleem
On Tue, 2017-03-28 at 21:00 +1100, Michael Ellerman wrote:
> Abdul Haleem  writes:
> 
> > Hi,
> >
> > While running kernel self tests on ppc64, tm/tm-signal-context-chk-vsx
> > tests fails with Oops message. 
> >
> > I was able to reproduce only twice out of 20 runs on next-20170324 only.
> > so it is difficult to bisect the commit causing the issue.
> 
> Can you try mainline as of this commit:
> 
> 605df8d674ac ("selftests/powerpc: Replace stxvx and lxvx with stxvd2x/lxvd2x")
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=605df8d674ac65e044a0bf4998b28c2f350b7f9e
> 
> cheers
> 


Trace is not reproducible on mainline with above commit.

Cyril was able to reproduce it and is working on it.




-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





[PATCH] powerpc/mm: remove stale comment

2017-04-03 Thread Oliver O'Halloran
The code to fix the problem it describes was removed in c40785a and it
uses the stupid comment style. Away it goes!

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/mm/hash_utils_64.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 8848fec..69a05b3 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -927,11 +927,6 @@ static void __init htab_initialize(void)
}
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
-   /* On U3 based machines, we need to reserve the DART area and
-* _NOT_ map it to avoid cache paradoxes as it's remapped non
-* cacheable later on
-*/
-
/* create bolted the linear mapping in the hash table */
for_each_memblock(memory, reg) {
base = (unsigned long)__va(reg->base);
-- 
2.9.3