Re: [PATCH] KVM: PPC: Book3S PR: Do not fail emulation with mtspr/mfspr for unknown SPRs
On Mon, Apr 03, 2017 at 01:23:15PM +0200, Thomas Huth wrote: > According to the PowerISA 2.07, mtspr and mfspr should not generate > an illegal instruction exception when being used with an undefined SPR, > but rather treat the instruction as a NOP, inject a privilege exception > or an emulation assistance exception - depending on the SPR number. The emulation assist interrupt is a hypervisor interrupt, so the guest would not be expecting to receive it. On a real machine, the hypervisor would synthesize an illegal instruction type program interrupt as described in the last programming note in section 6.5.9 of Book III of Power ISA v2.07B. Since we are the hypervisor here, we should synthesize a program interrupt rather than an emulation assist interrupt. > Also turn the printk here into a ratelimited print statement, so that > the guest can not flood the dmesg log of the host by issueing lots of > illegal mtspr/mfspr instruction here. Good idea. Paul.
Re: [PATCH] KVM: PPC: Book3S PR: Do not always inject facility unavailable exceptions
On Mon, Apr 03, 2017 at 01:28:34PM +0200, Thomas Huth wrote: > KVM should not inject a facility unavailable exception into the guest > when it tries to execute a mtspr/mfspr instruction for an SPR that > is unavailable, and the vCPU is *not* running in PRoblem state. > > It's right that we inject an exception when the vCPU is in PR mode, since > chapter "6.2.10 Facility Status and Control Register" of the PowerISA > v2.07 says that "When the FSCR makes a facility unavailable, attempted > usage of the facility in *problem state* is treated as follows: [...] > Access of an SPR using mfspr/mtspr causes a Facility Unavailable > interrupt". But if the guest vCPU is not in PR mode, we should follow > the behavior that is described in chapter "4.4.4 Move To/From System > Register Instructions" instead and treat the instruction as a NOP. This doesn't seem quite right. My reading of the ISA is that the FSCR bit for a facility being 0 doesn't prevent privileged code from accessing the facility, so we shouldn't be treating mfspr/mtspr as NOP. Instead we should be set the facility's bit in the shadow FSCR and re-execute the instruction (remembering of course to clear the FSCR bit when we go back to emulated problem state). For TM it's a bit different as the MSR[TM] bit does prevent privileged code from accessing TM registers and instructions, so for TM we should be delivering a facility unavailable interrupt even when the guest is in emulated privileged state. So I don't see any case where mfspr/mtspr should be treated as a NOP in response to a facility unavailable interrupt. Paul.
[PATCH] powerpc/hugetlb: Add ABI defines for MAP_HUGE_16MB and MAP_HUGE_16GB
This just adds user space exported ABI definitions for both 16MB and 16GB non default huge page sizes to be used with mmap() system call. Signed-off-by: Anshuman Khandual --- These defined values will be used along with MAP_HUGETLB while calling mmap() system call if the desired HugeTLB page size is not the default one. Follows similar definitions present in x86. arch/x86/include/uapi/asm/mman.h:#define MAP_HUGE_2MB(21 << MAP_HUGE_SHIFT) arch/x86/include/uapi/asm/mman.h:#define MAP_HUGE_1GB(30 << MAP_HUGE_SHIFT) arch/powerpc/include/uapi/asm/mman.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h index 03c06ba..e78980b 100644 --- a/arch/powerpc/include/uapi/asm/mman.h +++ b/arch/powerpc/include/uapi/asm/mman.h @@ -29,4 +29,7 @@ #define MAP_STACK 0x2 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB0x4 /* create a huge page mapping */ +#define MAP_HUGE_16MB (24 << MAP_HUGE_SHIFT) /* 16MB HugeTLB Page */ +#define MAP_HUGE_16GB (34 << MAP_HUGE_SHIFT) /* 16GB HugeTLB Page */ + #endif /* _UAPI_ASM_POWERPC_MMAN_H */ -- 1.8.5.2
Re: [7/7] crypto: caam/qi - add ablkcipher and authenc algorithms
Horia Geantă writes: > Add support to submit ablkcipher and authenc algorithms > via the QI backend: > -ablkcipher: > cbc({aes,des,des3_ede}) > ctr(aes), rfc3686(ctr(aes)) > xts(aes) > -authenc: > authenc(hmac(md5),cbc({aes,des,des3_ede})) > authenc(hmac(sha*),cbc({aes,des,des3_ede})) > > caam/qi being a new driver, let's wait some time to settle down without > interfering with existing caam/jr driver. > Accordingly, for now all caam/qi algorithms (caamalg_qi module) are > marked to be of lower priority than caam/jr ones (caamalg module). > > Signed-off-by: Vakul Garg > Signed-off-by: Alex Porosanu > Signed-off-by: Horia Geantă > --- > drivers/crypto/caam/Kconfig| 20 +- > drivers/crypto/caam/Makefile |1 + > drivers/crypto/caam/caamalg.c |9 +- > drivers/crypto/caam/caamalg_desc.c | 77 +- > drivers/crypto/caam/caamalg_desc.h | 15 +- > drivers/crypto/caam/caamalg_qi.c | 2387 > > drivers/crypto/caam/sg_sw_qm.h | 108 ++ > 7 files changed, 2601 insertions(+), 16 deletions(-) > create mode 100644 drivers/crypto/caam/caamalg_qi.c > create mode 100644 drivers/crypto/caam/sg_sw_qm.h This appears to be blowing up my Freescale (NXP) P5020DS board: Unable to handle kernel paging request for data at address 0x0020 Faulting instruction address: 0xc04393e4 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=24 CoreNet Generic Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc3-compiler_gcc-4.6.3-00046-gb189817cf789 #5 task: c000f70c task.stack: c000f70c8000 NIP: c04393e4 LR: c04aeba0 CTR: c04fa7d8 REGS: c000f70cb160 TRAP: 0300 Not tainted (4.11.0-rc3-compiler_gcc-4.6.3-00046-gb189817cf789) MSR: 80029000 CR: 24adbe48 XER: 2000 DEAR: 0020 ESR: SOFTE: 1 GPR00: c06feba0 c000f70cb3e0 c0e6 GPR04: 0001 c0e0b290 0003 GPR08: 0004 c0ea5280 0004 0004 GPR12: 88adbe22 c0003fff5000 c0ba3518 880088090fa8 GPR16: 1000 c0ba3500 c000f72c68d8 0004 GPR20: c0ea5280 c0ba34e8 0020 0004 GPR24: c0eab7c0 c000f7fc8818 c0eb GPR28: c000f786cc00 c0eab780 f786cc00 c0eab7c0 NIP [c04393e4] .gen_pool_alloc+0x0/0xc LR [c04aeba0] .qman_alloc_cgrid_range+0x24/0x54 Call Trace: [c000f70cb3e0] [c0504054] .platform_device_register_full+0x12c/0x150 (unreliable) [c000f70cb460] [c06feba0] .caam_qi_init+0x158/0x63c [c000f70cb5f0] [c06fc64c] .caam_probe+0x8b8/0x1830 [c000f70cb740] [c0503288] .platform_drv_probe+0x60/0xac [c000f70cb7c0] [c0501194] .driver_probe_device+0x248/0x344 [c000f70cb870] [c05013b4] .__driver_attach+0x124/0x128 [c000f70cb900] [c04fed90] .bus_for_each_dev+0x80/0xcc [c000f70cb9a0] [c0500858] .driver_attach+0x24/0x38 [c000f70cba10] [c050043c] .bus_add_driver+0x14c/0x29c [c000f70cbab0] [c0501d64] .driver_register+0x8c/0x154 [c000f70cbb30] [c0503000] .__platform_driver_register+0x48/0x5c [c000f70cbba0] [c0c7f798] .caam_driver_init+0x1c/0x30 [c000f70cbc10] [c0001904] .do_one_initcall+0x60/0x1a8 [c000f70cbcf0] [c0c35f30] .kernel_init_freeable+0x248/0x334 [c000f70cbdb0] [c00020fc] .kernel_init+0x1c/0xf20 [c000f70cbe30] [c9bc] .ret_from_kernel_thread+0x58/0x9c Instruction dump: eb61ffd8 eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3860 4bb0 7ce53b78 4b0c 7f67db78 4b24 e8c30028 4bfffd30 fbe1fff8 ---[ end trace 9f61087068991b02 ]--- home:linux-next(4)(I)> git bisect log ... git bisect bad b189817cf7894e03fd3700acd923221d3007259e # first bad commit: [b189817cf7894e03fd3700acd923221d3007259e] crypto: caam/qi - add ablkcipher and authenc algorithms The oops is saying gen_pool_alloc() was called with a NULL pointer, so it seems qm_cgralloc is NULL: static int qman_alloc_range(struct gen_pool *p, u32 *result, u32 cnt) { unsigned long addr; addr = gen_pool_alloc(p, cnt); ... int qman_alloc_cgrid_range(u32 *result, u32 count) { return qman_alloc_range(qm_cgralloc, result, count); } I didn't pull the thread any further than that. cheers
[PATCH] powerpc: Don't try to fix up misaligned load-with-reservation instructions
In the past, there was only one load-with-reservation instruction, lwarx, and if a program attempted a lwarx on a misaligned address, it would take an alignment interrupt and the kernel handler would emulate it as though it was lwzx, which was not really correct, but benign since it is loading the right amount of data, and the lwarx should be paired with a stwcx. to the same address, which would also cause an alignment interrupt which would result in a SIGBUS being delivered to the process. We now have 5 different sizes of load-with-reservation instruction. Of those, lharx and ldarx cause an immediate SIGBUS by luck since their entries in aligninfo[] overlap instructions which were not fixed up, but lqarx overlaps with lhz and will be emulated as such. lbarx can never generate an alignment interrupt since it only operates on 1 byte. To straighten this out and fix the lqarx case, this adds code to detect the l[hwdq]arx instructions and return without fixing them up, resulting in a SIGBUS being delivered to the process. Signed-off-by: Paul Mackerras --- arch/powerpc/kernel/align.c | 27 +++ 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c index cbc7c42cdb74..ec7a8b099dd9 100644 --- a/arch/powerpc/kernel/align.c +++ b/arch/powerpc/kernel/align.c @@ -807,14 +807,25 @@ int fix_alignment(struct pt_regs *regs) nb = aligninfo[instr].len; flags = aligninfo[instr].flags; - /* ldbrx/stdbrx overlap lfs/stfs in the DSISR unfortunately */ - if (IS_XFORM(instruction) && ((instruction >> 1) & 0x3ff) == 532) { - nb = 8; - flags = LD+SW; - } else if (IS_XFORM(instruction) && - ((instruction >> 1) & 0x3ff) == 660) { - nb = 8; - flags = ST+SW; + /* +* Handle some cases which give overlaps in the DSISR values. +*/ + if (IS_XFORM(instruction)) { + switch (get_xop(instruction)) { + case 532: /* ldbrx */ + nb = 8; + flags = LD+SW; + break; + case 660: /* stdbrx */ + nb = 8; + flags = ST+SW; + break; + case 20:/* lwarx */ + case 84:/* ldarx */ + case 116: /* lharx */ + case 276: /* lqarx */ + return 0; /* not emulated ever */ + } } /* Byteswap little endian loads and stores */ -- 2.11.0
Re: [PATCH v6 06/11] powerpc/perf: IMC pmu cpumask and cpu hotplug support
Madhavan Srinivasan writes: > From: Hemant Kumar > > Adds cpumask attribute to be used by each IMC pmu. Only one cpu (any > online CPU) from each chip for nest PMUs is designated to read counters. > > On CPU hotplug, dying CPU is checked to see whether it is one of the > designated cpus, if yes, next online cpu from the same chip (for nest > units) is designated as new cpu to read counters. For this purpose, we > introduce a new state : CPUHP_AP_PERF_POWERPC_NEST_ONLINE. > > Signed-off-by: Anju T Sudhakar > Signed-off-by: Hemant Kumar > Signed-off-by: Madhavan Srinivasan > --- > arch/powerpc/include/asm/opal-api.h| 13 ++- > arch/powerpc/include/asm/opal.h| 3 + > arch/powerpc/perf/imc-pmu.c| 155 > - > arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + > include/linux/cpuhotplug.h | 1 + > 5 files changed, 171 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/include/asm/opal-api.h > b/arch/powerpc/include/asm/opal-api.h > index a0aa285869b5..23fc51e9d71d 100644 > --- a/arch/powerpc/include/asm/opal-api.h > +++ b/arch/powerpc/include/asm/opal-api.h > @@ -168,7 +168,8 @@ > #define OPAL_INT_SET_MFRR125 > #define OPAL_PCI_TCE_KILL126 > #define OPAL_NMMU_SET_PTCR 127 > -#define OPAL_LAST127 > +#define OPAL_NEST_IMC_COUNTERS_CONTROL 149 A couple of things: - why is this 149 rather than 128? - CONTROL seems like it's inviting a very broad and under-specified API. I notice most of the other opal calls are reasonably narrow: often defining two calls for get/set rather than a single control call. I don't have cycles to review the OPAL/skiboot side and I'm very much open to being convinced here, I just want to avoid the situation where we're passing around complicated data structures in a way that is difficult to synchronise if we could avoid it. > +#define OPAL_LAST149 > > /* Device tree flags */ > > @@ -928,6 +929,16 @@ enum { > OPAL_PCI_TCE_KILL_ALL, > }; > > +/* Operation argument to OPAL_NEST_IMC_COUNTERS_CONTROL */ > +enum { > + OPAL_NEST_IMC_PRODUCTION_MODE = 1, > +}; If there's only one mode, why do we need to specify it? As far as I can tell no extra mode is added later in the series... ... looking at the opal-side patches, would it be better to just specify the modes you intend to support in future here, and rely on opal returning OPAL_PARAMETER when they're not supported? > + > +enum { > + OPAL_NEST_IMC_STOP, > + OPAL_NEST_IMC_START, > +}; Again, would it be better to have a stop/start call rather than a generic control call? Also, the enum values (STOP=0, START=1) seem to be switched vs OPAL, where https://patchwork.ozlabs.org/patch/746292/ has STOP=1, START=0. Or am I missing something? > + > #endif /* __ASSEMBLY__ */ > > #endif /* __OPAL_API_H */ > diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h > index 1ff03a6da76e..d93d08204243 100644 > --- a/arch/powerpc/include/asm/opal.h > +++ b/arch/powerpc/include/asm/opal.h > @@ -227,6 +227,9 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t > kill_type, > uint64_t dma_addr, uint32_t npages); > int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr); > > +int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1, > + uint64_t value2, uint64_t value3); > + This prototype does not make me feel better about the state of the API. Looking at the opal side, I would be much more comfortable if you could nail down what you intend to have these support now, even if OPAL bails with OPAL_PARAMETER if they're specified. Also I think u64 and u32 are preferred, although I see the rest of the file uses the long form. > /* Internal functions */ > extern int early_init_dt_scan_opal(unsigned long node, const char *uname, > int depth, void *data); > diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c > index bd5bf66bd920..b28835b861b3 100644 > --- a/arch/powerpc/perf/imc-pmu.c > +++ b/arch/powerpc/perf/imc-pmu.c > @@ -17,6 +17,7 @@ > > struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; > struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; > +static cpumask_t nest_imc_cpumask; > > /* Needed for sanity check */ > extern u64 nest_max_offset; > @@ -32,6 +33,152 @@ static struct attribute_group imc_format_group = { > .attrs = imc_format_attrs, > }; > > +/* Get the cpumask printed to a buffer "buf" */ > +static ssize_t imc_pmu_cpumask_get_attr(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + cpumask_t *active_mask; > + > + active_mask = &nest_imc_cpumask; > + return cpumap_print_to_pagebuf(true, buf, active_mask); > +} > + > +static DEVICE_
Re: [PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority
* Nicholas Piggin [2017-04-04 09:52:07]: > On Tue, 4 Apr 2017 07:54:12 +1000 > Anton Blanchard wrote: > > > From: Anton Blanchard > > > > The core of snooze_loop() continually bounces between low and very > > low thread priority. Changing thread priorities is an expensive > > operation that can negatively impact other threads on a core. > > > > All CPUs that can run PowerNV support very low priority, so we can > > avoid the change completely. > > This looks good. I have HMT_lowest() which does alt feature patching > we can use for pseries and default idle code. Alternatively, if we are going to set priority only once in various other places, HMT_low(); HMT_very_low(); should not add to extra cycles. Let me code that up. --Vaidy
Re: [RFC][PATCH] spin loop arch primitives for busy waiting
On Tue, 4 Apr 2017 13:02:33 +1000 Nicholas Piggin wrote: > On Mon, 3 Apr 2017 17:43:05 -0700 > Linus Torvalds wrote: > > > But that depends on architectures having some pattern that we *can* > > abstract. Would some "begin/in-loop/end" pattern like the above be > > sufficient? > > Yes. begin/in/end would be sufficient for powerpc SMT priority, and > for x86, and it looks like sparc64 too. So we could do that if you > prefer. How's this? I changed your name a bit just so we have a common spin_ prefix. With example powerpc implementation and one caller converted to see the effect. --- arch/powerpc/include/asm/processor.h | 17 + include/linux/processor.h| 48 kernel/sched/idle.c | 7 +- 3 files changed, 71 insertions(+), 1 deletion(-) create mode 100644 include/linux/processor.h diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index e9bbd450d966..1274dc818e74 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -402,6 +402,23 @@ static inline unsigned long __pack_fe01(unsigned int fpmode) #ifdef CONFIG_PPC64 #define cpu_relax()do { HMT_low(); HMT_medium(); barrier(); } while (0) + +#ifndef spin_begin +#define spin_begin() HMT_low() +#endif + +#ifndef spin_cpu_relax +#define spin_cpu_relax() barrier() +#endif + +#ifndef spin_cpu_yield +#define spin_cpu_yield() +#endif + +#ifndef spin_end +#define spin_end() HMT_medium() +#endif + #else #define cpu_relax()barrier() #endif diff --git a/include/linux/processor.h b/include/linux/processor.h new file mode 100644 index ..65e5635d0069 --- /dev/null +++ b/include/linux/processor.h @@ -0,0 +1,48 @@ +/* Misc low level processor primitives */ +#ifndef _LINUX_PROCESSOR_H +#define _LINUX_PROCESSOR_H + +#include + +/* + * spin_begin is used before beginning a busy-wait loop, and must be paired + * with spin_end when the loop is exited. spin_cpu_relax must be called + * within the loop. + * + * These loop body should be as small and fast as possible, on the order of + * tens of instructions/cycles as a guide. It should and avoid calling + * cpu_relax, or any "spin" or sleep type of primitive including nested uses + * of these primitives. It should not lock or take any other resource. + * Violations of this will not cause a bug, but may cause sub optimal + * performance. + * + * These loops are optimized to be used where wait times are expected to be + * less than the cost of a context switch (and associated overhead). + * + * Detection of resource owner and decision to spin or sleep or guest-yield + * (e.g., spin lock holder vcpu preempted, or mutex owner not on CPU) can be + * tested within the busy loop body if necessary. + */ +#ifndef spin_begin +#define spin_begin() +#endif + +#ifndef spin_cpu_relax +#define spin_cpu_relax() cpu_relax() +#endif + +/* + * spin_cpu_yield may be called to yield (undirected) to the hypervisor if + * necessary. This should be used if the wait is expected to take longer + * than context switch overhead, but we can't sleep or do a directed yield. + */ +#ifndef spin_cpu_yield +#define spin_cpu_yield() cpu_relax_yield() +#endif + +#ifndef spin_end +#define spin_end() +#endif + +#endif /* _LINUX_PROCESSOR_H */ + diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index ac6d5176463d..99a032d9f4a9 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -10,6 +10,7 @@ #include #include #include +#include #include @@ -63,9 +64,13 @@ static noinline int __cpuidle cpu_idle_poll(void) trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); stop_critical_timings(); + + spin_begin(); while (!tif_need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) - cpu_relax(); + spin_cpu_relax(); + spin_end(); + start_critical_timings(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); -- 2.11.0
Re: [PATCH 3/3] cpuidle: powernv: Avoid a branch in the core snooze_loop() loop
* Anton Blanchard [2017-04-04 07:54:14]: > From: Anton Blanchard > > When in the snooze_loop() we want to take up the least amount of > resources. On my version of gcc (6.3), we end up with an extra > branch because it predicts snooze_timeout_en to be false, whereas it > is almost always true. By default snooze_timeout_en is true. It will become false only when we do not want to exit the snooze loop and that will be when snooze is the only idle state available in the platform, which is a rare case. > Use likely() to avoid the branch and be a little nicer to the > other non idle threads on the core. > > Signed-off-by: Anton Blanchard Reviewed-by: Vaidyanathan Srinivasan > --- > drivers/cpuidle/cpuidle-powernv.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index 8c991c254b95..251a60bfa8ee 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -58,7 +58,7 @@ static int snooze_loop(struct cpuidle_device *dev, > ppc64_runlatch_off(); > HMT_very_low(); > while (!need_resched()) { > - if (snooze_timeout_en && get_tb() > snooze_exit_time) > + if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time) > break; > } > > -- > 2.11.0 >
Re: [PATCH 2/3] cpuidle: powernv: Don't continually set thread priority in snooze_loop()
* Anton Blanchard [2017-04-04 07:54:13]: > From: Anton Blanchard > > The powerpc64 kernel exception handlers have preserved thread priorities > for a long time now, so there is no need to continually set it. > > Just set it once on entry and once exit. > > Signed-off-by: Anton Blanchard Reviewed-by: Vaidyanathan Srinivasan > --- > drivers/cpuidle/cpuidle-powernv.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index 9d9f164894eb..8c991c254b95 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -56,8 +56,8 @@ static int snooze_loop(struct cpuidle_device *dev, > > snooze_exit_time = get_tb() + snooze_timeout; > ppc64_runlatch_off(); > + HMT_very_low(); > while (!need_resched()) { > - HMT_very_low(); > if (snooze_timeout_en && get_tb() > snooze_exit_time) > break; > } > -- > 2.11.0 >
Re: [PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority
* Anton Blanchard [2017-04-04 07:54:12]: > From: Anton Blanchard > > The core of snooze_loop() continually bounces between low and very > low thread priority. Changing thread priorities is an expensive > operation that can negatively impact other threads on a core. > > All CPUs that can run PowerNV support very low priority, so we can > avoid the change completely. > > Signed-off-by: Anton Blanchard Reviewed-by: Vaidyanathan Srinivasan > --- > drivers/cpuidle/cpuidle-powernv.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index cda8f62d555b..9d9f164894eb 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -57,7 +57,6 @@ static int snooze_loop(struct cpuidle_device *dev, > snooze_exit_time = get_tb() + snooze_timeout; > ppc64_runlatch_off(); > while (!need_resched()) { > - HMT_low(); > HMT_very_low(); > if (snooze_timeout_en && get_tb() > snooze_exit_time) > break; HMT_low() is legacy and can be removed for powernv platforms. --Vaidy
Re: [PATCH v6 05/11] powerpc/perf: Generic imc pmu event functions
Hi, So a major complaint I have is that you're changing prototypes of functions from earlier patches. This makes my life a lot harder: I get my head around what a function is and does and then suddenly the prototype changes, the behaviour changes, and I have to re-evaluate everything I thought I knew about the function. The diffs are also usually quite unhelpful. It would be far better, from my point of view, to squash related commits together. You're adding a large-ish driver: we might as well review one large, complete commit rather than several smaller incomplete commits. There are a lot of interrelated benefits to this - just from the patches I've reviewed so far, if there were fewer, larger commits then: - I would only have to read a function once to get a full picture of what it does - comments in function headers wouldn't get out of sync with function bodies - there would be less movement of variables from file to file - earlier comments wouldn't be invalidated by things I learn later. For example I suggested different names for imc_event_info_{str,val} based on their behaviour when first added in patch 3. Here you change the behaviour of imc_event_info_val - it would have been helpful to see the fuller picture when I was first reviewing as I would have been able to make more helpful suggestions. and so on. Anyway, further comments in line. > From: Hemant Kumar > > Since, the IMC counters' data are periodically fed to a memory location, > the functions to read/update, start/stop, add/del can be generic and can > be used by all IMC PMU units. > > This patch adds a set of generic imc pmu related event functions to be > used by each imc pmu unit. Add code to setup format attribute and to > register imc pmus. Add a event_init function for nest_imc events. > > Signed-off-by: Anju T Sudhakar > Signed-off-by: Hemant Kumar > Signed-off-by: Madhavan Srinivasan > --- > arch/powerpc/include/asm/imc-pmu.h| 1 + > arch/powerpc/perf/imc-pmu.c | 137 > ++ > arch/powerpc/platforms/powernv/opal-imc.c | 30 ++- > 3 files changed, 164 insertions(+), 4 deletions(-) > > diff --git a/arch/powerpc/include/asm/imc-pmu.h > b/arch/powerpc/include/asm/imc-pmu.h > index a3d4f1bf9492..e13f51047522 100644 > --- a/arch/powerpc/include/asm/imc-pmu.h > +++ b/arch/powerpc/include/asm/imc-pmu.h > @@ -65,4 +65,5 @@ struct imc_pmu { > #define IMC_DOMAIN_NEST 1 > #define IMC_DOMAIN_UNKNOWN -1 > > +int imc_get_domain(struct device_node *pmu_dev); > #endif /* PPC_POWERNV_IMC_PMU_DEF_H */ > diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c > index ec464c76b749..bd5bf66bd920 100644 > --- a/arch/powerpc/perf/imc-pmu.c > +++ b/arch/powerpc/perf/imc-pmu.c > @@ -18,6 +18,132 @@ > struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; > struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; > > +/* Needed for sanity check */ > +extern u64 nest_max_offset; Should this be in a header file? > + > +PMU_FORMAT_ATTR(event, "config:0-20"); > +static struct attribute *imc_format_attrs[] = { > + &format_attr_event.attr, > + NULL, > +}; > + > +static struct attribute_group imc_format_group = { > + .name = "format", > + .attrs = imc_format_attrs, > +}; > + > +static int nest_imc_event_init(struct perf_event *event) > +{ > + int chip_id; > + u32 config = event->attr.config; > + struct perchip_nest_info *pcni; > + > + if (event->attr.type != event->pmu->type) > + return -ENOENT; > + > + /* Sampling not supported */ > + if (event->hw.sample_period) > + return -EINVAL; > + > + /* unsupported modes and filters */ > + if (event->attr.exclude_user || > + event->attr.exclude_kernel || > + event->attr.exclude_hv || > + event->attr.exclude_idle || > + event->attr.exclude_host || > + event->attr.exclude_guest) > + return -EINVAL; > + > + if (event->cpu < 0) > + return -EINVAL; > + > + /* Sanity check for config (event offset) */ > + if (config > nest_max_offset) > + return -EINVAL; config is a u32, nest_max_offset is a u64. Is there a reason for this? (Also, config is an unintuitive name for the local variable - the data is stored in the attribute config space but the value represents an offset into the reserved memory region.) > + > + chip_id = topology_physical_package_id(event->cpu); > + pcni = &nest_perchip_info[chip_id]; > + > + /* > + * Memory for Nest HW counter data could be in multiple pages. > + * Hence check and pick the right base page from the event offset, > + * and then add to it. > + */ > + event->hw.event_base = pcni->vbase[config/PAGE_SIZE] + > + (config & ~PAGE_MASK); > + > + return 0; > +} > + > +static void imc_read_counter(struct perf_event *even
Re: [RFC][PATCH] spin loop arch primitives for busy waiting
On Mon, 3 Apr 2017 17:43:05 -0700 Linus Torvalds wrote: > On Mon, Apr 3, 2017 at 4:50 PM, Nicholas Piggin wrote: > > If you have any ideas, I'd be open to them. > > So the idea would be that maybe we can just make those things > explicit. IOW, instead of having that magical looping construct that > does other magical hidden things as part of the loop, maybe we can > just have a > >begin_cpu_relax(); >while (!cond) >cpu_relax(); >end_cpu_relax(); > > and then architectures can decide how they implement it. So for x86, > the begin/end macros would be empty. For ppc, maybe begin/end would be > the "lower and raise priority", while cpu_relax() itself is an empty > thing. > > Or maybe "begin" just clears a counter, while "cpu_relax()" does some > "increase iterations, and lower priority after X iterations", and then > "end" raises the priority again. > > The "do magic having a special loop" approach disturbs me. I'd much > rather have more explicit hooks that allow people to do their own loop > semantics (including having a "return" to exit early). I guess so. Users will still have to read the documentation rather than just throw it in ad hoc because it seems like a good idea. For example powerpc must not use any other primitives that might change SMT priority in the idle loop. For x86 if you do too much work, the rep ; nop component becomes relatively small and you lose benefit (10 cycles latency pre-skylake). I would suggest keeping standalone cpu_relax() for incidental code in drivers and things (powerpc may add some extra nops between SMT low and SMT normal priorities to improve it a little), and this would be used by important core code and primitives. > But that depends on architectures having some pattern that we *can* > abstract. Would some "begin/in-loop/end" pattern like the above be > sufficient? Yes. begin/in/end would be sufficient for powerpc SMT priority, and for x86, and it looks like sparc64 too. So we could do that if you prefer. After this, I might look at optimizing the loop code itself (optimizing exit condition, de-pipelining, etc). That would require spin loop primitives like this again. BUT they would not have funny restrictions on exit conditions because they wouldn't do SMT priority. If I get positive numbers for that, would you be opposed to having such primitives (just for the important core spin loops)? > I think s390 might have issues too, since they tried to have that > "cpu_relax_yield" thing (which is only used by stop_machine), and > they've tried cpu_relax_lowlatency() and other games. There are some considerations with hv/guest yielding, yes. I'm not touching that yet, but it needs to be looked at. Generic code has no chance of looking at primitives available and deciding which should be best used (not least because they aren't documented). I think for the most part, busy loops shouldn't be done if they tend to be more expensive than a context switch + associated cache costs, so hypervisors are okay there. It's just rare cases where we *have* to spin. Directed spinning or yielding for a resource is possibly more general pattern and something to look at adding to these spin APIs, but for now they should be okay to just remain within the loop body. Summary: hypervisor guests should not be affected by the new APIs, but we could look at augmenting them later with some hv hints. Thanks, Nick
Re: [PATCH v6 04/11] powerpc/perf: Add event attribute and group to IMC pmus
Hi, > Device tree IMC driver code parses the IMC units and their events. It > passes the information to IMC pmu code which is placed in powerpc/perf > as "imc-pmu.c". > > This patch creates only event attributes and attribute groups for the > IMC pmus. > > Signed-off-by: Anju T Sudhakar > Signed-off-by: Hemant Kumar > Signed-off-by: Madhavan Srinivasan > --- > arch/powerpc/perf/Makefile| 6 +- > arch/powerpc/perf/imc-pmu.c | 97 > +++ > arch/powerpc/platforms/powernv/opal-imc.c | 12 +++- > 3 files changed, 112 insertions(+), 3 deletions(-) > create mode 100644 arch/powerpc/perf/imc-pmu.c > > diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile > index 4d606b99a5cb..d0d1f04203c7 100644 > --- a/arch/powerpc/perf/Makefile > +++ b/arch/powerpc/perf/Makefile > @@ -2,10 +2,14 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror > > obj-$(CONFIG_PERF_EVENTS)+= callchain.o perf_regs.o > > +imc-$(CONFIG_PPC_POWERNV) += imc-pmu.o > + > obj-$(CONFIG_PPC_PERF_CTRS) += core-book3s.o bhrb.o > obj64-$(CONFIG_PPC_PERF_CTRS)+= power4-pmu.o ppc970-pmu.o > power5-pmu.o \ > power5+-pmu.o power6-pmu.o power7-pmu.o \ > -isa207-common.o power8-pmu.o power9-pmu.o > +isa207-common.o power8-pmu.o power9-pmu.o \ > +$(imc-y) Hmm, this seems... obtuse. Will the IMC stuff fail to compile on non-powerNV? Can we just link it in like we do with every other sort of performance counter and let runtime detection do its thing? > + > obj32-$(CONFIG_PPC_PERF_CTRS)+= mpc7450-pmu.o > > obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o > diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c > new file mode 100644 > index ..ec464c76b749 > --- /dev/null > +++ b/arch/powerpc/perf/imc-pmu.c > @@ -0,0 +1,97 @@ > +/* > + * Nest Performance Monitor counter support. > + * > + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation. > + *(C) 2016 Hemant K Shaw, IBM Corporation. > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2 as published > + * by the Free Software Foundation. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; > +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; > + > +/* dev_str_attr : Populate event "name" and string "str" in attribute */ > +static struct attribute *dev_str_attr(const char *name, const char *str) > +{ > + struct perf_pmu_events_attr *attr; > + > + attr = kzalloc(sizeof(*attr), GFP_KERNEL); > + > + sysfs_attr_init(&attr->attr.attr); > + > + attr->event_str = str; > + attr->attr.attr.name = name; > + attr->attr.attr.mode = 0444; > + attr->attr.show = perf_event_sysfs_show; > + > + return &attr->attr.attr; > +} > + > +/* > + * update_events_in_group: Update the "events" information in an attr_group > + * and assign the attr_group to the pmu "pmu". > + */ > +static int update_events_in_group(struct imc_events *events, > + int idx, struct imc_pmu *pmu) So I've probably missed a key point in the earlier patches, but for the life of me I cannot figure out what idx is supposed to represent. > +{ > + struct attribute_group *attr_group; > + struct attribute **attrs; > + int i; > + > + /* Allocate memory for attribute group */ > + attr_group = kzalloc(sizeof(*attr_group), GFP_KERNEL); > + if (!attr_group) > + return -ENOMEM; > + > + /* Allocate memory for attributes */ > + attrs = kzalloc((sizeof(struct attribute *) * (idx + 1)), GFP_KERNEL); Also, idx is an int, so: - things will go wrong if idx is -1 - things will go very wrong if idx is -2 - do you need to do some sort of validation to make sure it's within bounds? If not in this function, what function ensures the necessary 'sanity check' preconditions for idx? > + if (!attrs) { > + kfree(attr_group); > + return -ENOMEM; > + } > + > + attr_group->name = "events"; > + attr_group->attrs = attrs; > + for (i = 0; i < idx; i++, events++) { > + attrs[i] = dev_str_attr((char *)events->ev_name, > + (char *)events->ev_value); > + } > + > + pmu->attr_groups[0] = attr_group; Again, I may have missed something, but what's special about group 0? Patch 1 lets you have up to 4 groups - are they used elsewhere? > + return 0; > +} > + > +/* > + * init_imc_pmu : Setup the IMC pmu device in "pmu_ptr" and its events > + *"events". > + * Setup the cpu mask information for these pmus and setup the state machine > + * hotplug notifiers as well. I cannot
Re: [PATCH v6 01/11] powerpc/powernv: Data structure and macros definitions
Hi, > +#define IMC_MAX_CHIPS32 > +#define IMC_MAX_PMUS 32 > +#define IMC_MAX_PMU_NAME_LEN 256 I've noticed this is used as both the maximum length for event names and event value strings. Would another name suit better? > + > +#define IMC_NEST_MAX_PAGES 16 > + > +#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters" > +#define IMC_DTB_NEST_COMPAT "ibm,imc-counters-nest" > + > +/* > + * Structure to hold per chip specific memory address > + * information for nest pmus. Nest Counter data are exported > + * in per-chip reserved memory region by the PORE Engine. > + */ > +struct perchip_nest_info { > + u32 chip_id; > + u64 pbase; > + u64 vbase[IMC_NEST_MAX_PAGES]; > + u64 size; > +}; > + > +/* > + * Place holder for nest pmu events and values. > + */ > +struct imc_events { > + char *ev_name; > + char *ev_value; > +}; > + > +/* > + * Device tree parser code detects IMC pmu support and > + * registers new IMC pmus. This structure will > + * hold the pmu functions and attrs for each imc pmu and > + * will be referenced at the time of pmu registration. > + */ > +struct imc_pmu { > + struct pmu pmu; > + int domain; > + const struct attribute_group *attr_groups[4]; > +}; > + > +/* > + * Domains for IMC PMUs > + */ > +#define IMC_DOMAIN_NEST 1 > +#define IMC_DOMAIN_UNKNOWN -1 > + > +#endif /* PPC_POWERNV_IMC_PMU_DEF_H */ > -- > 2.7.4
Re: [PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module
Hi, > + do { > + pages = PAGE_SIZE * i; > + pcni->vbase[i++] = (u64)phys_to_virt(pcni->pbase + > + pages); > + } while (i < (pcni->size / PAGE_SIZE)); I also just noticed that there's no check here against IMC_NEST_MAX_PAGES: should that be inserted? (If for no other reason than to stop every static analysis tool complaining!) Daniel > + } > + > + return 0; > +err: > + return -ENODEV; > +} > + > +static const struct of_device_id opal_imc_match[] = { > + { .compatible = IMC_DTB_COMPAT }, > + {}, > +}; > + > +static struct platform_driver opal_imc_driver = { > + .driver = { > + .name = "opal-imc-counters", > + .of_match_table = opal_imc_match, > + }, > + .probe = opal_imc_counters_probe, > +}; > + > +MODULE_DEVICE_TABLE(of, opal_imc_match); > +module_platform_driver(opal_imc_driver); > +MODULE_DESCRIPTION("PowerNV OPAL IMC driver"); > +MODULE_LICENSE("GPL"); > diff --git a/arch/powerpc/platforms/powernv/opal.c > b/arch/powerpc/platforms/powernv/opal.c > index e0f856bfbfe8..85ea1296f030 100644 > --- a/arch/powerpc/platforms/powernv/opal.c > +++ b/arch/powerpc/platforms/powernv/opal.c > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -30,6 +31,7 @@ > #include > #include > #include > +#include > > #include "powernv.h" > > @@ -631,6 +633,15 @@ static void opal_pdev_init(const char *compatible) > of_platform_device_create(np, NULL, NULL); > } > > +static void opal_imc_init_dev(void) > +{ > + struct device_node *np; > + > + np = of_find_compatible_node(NULL, NULL, IMC_DTB_COMPAT); > + if (np) > + of_platform_device_create(np, NULL, NULL); > +} > + > static int kopald(void *unused) > { > unsigned long timeout = msecs_to_jiffies(opal_heartbeat) + 1; > @@ -704,6 +715,9 @@ static int __init opal_init(void) > /* Setup a heatbeat thread if requested by OPAL */ > opal_init_heartbeat(); > > + /* Detect IMC pmu counters support and create PMUs */ > + opal_imc_init_dev(); > + > /* Create leds platform devices */ > leds = of_find_node_by_path("/ibm,opal/leds"); > if (leds) { > -- > 2.7.4
Re: [PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events
Madhavan Srinivasan writes: > From: Hemant Kumar > > Parse device tree to detect IMC units. Traverse through each IMC unit > node to find supported events and corresponding unit/scale files (if any). > > Here is the DTS file for reference: > > > https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts > > The device tree for IMC counters starts at the node "imc-counters". > This node contains all the IMC PMU nodes and event nodes > for these IMC PMUs. The PMU nodes have an "events" property which has a > phandle value for the actual events node. The events are separated from > the PMU nodes to abstract out the common events. For example, PMU node > "mcs0", "mcs1" etc. will contain a pointer to "nest-mcs-events" since, > the events are common between these PMUs. These events have a different > prefix based on their relation to different PMUs, and hence, the PMU > nodes themselves contain an "events-prefix" property. The value for this > property concatenated to the event name, forms the actual event > name. Also, the PMU have a "reg" field as the base offset for the events > which belong to this PMU. This "reg" field is added to an event in the > "events" node, which gives us the location of the counter data. Kernel > code uses this offset as event configuration value. This is probably because I haven't looked at this area recently, but what do you mean by 'event configuration value'? If I understand correctly, you're saying something like "kernel code stores this calculated location in the event configuration data" - is that about right? Apologies if this is a dumb question. > > Device tree parser code also looks for scale/unit property in the event > node and passes on the value as an event attr for perf interface to use > in the post processing by the perf tool. Some PMUs may have common scale > and unit properties which implies that all events supported by this PMU > inherit the scale and unit properties of the PMU itself. For those > events, we need to set the common unit and scale values. > > For failure to initialize any unit or any event, disable that unit and > continue setting up the rest of them. > > Signed-off-by: Hemant Kumar > Signed-off-by: Anju T Sudhakar > Signed-off-by: Madhavan Srinivasan > --- > arch/powerpc/platforms/powernv/opal-imc.c | 386 > ++ > 1 file changed, 386 insertions(+) > > diff --git a/arch/powerpc/platforms/powernv/opal-imc.c > b/arch/powerpc/platforms/powernv/opal-imc.c > index c476d596c6a8..ba0203e669c0 100644 > --- a/arch/powerpc/platforms/powernv/opal-imc.c > +++ b/arch/powerpc/platforms/powernv/opal-imc.c > @@ -33,6 +33,388 @@ > #include > > struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; > +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; > + > +static int imc_event_info(char *name, struct imc_events *events) This seems to be an allocation/init function: should that be reflected in the name? > +{ > + char *buf; > + > + /* memory for content */ > + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL); > + if (!buf) > + return -ENOMEM; > + > + events->ev_name = name; We trust the name will never become a dangling pointer? We don't need to duplicate it? Should the pointer be const? > + events->ev_value = buf; > + return 0; > +} > + > +static int imc_event_info_str(struct property *pp, char *name, > +struct imc_events *events) This creates and sets an initial value based on a property, ... > +static int imc_event_info_val(char *name, u32 val, > + struct imc_events *events) ... this creates and sets based on a u32. Could you call them something that suggests their purpose a bit better? Maybe event_info_from_{property,val}? > +{ > + int ret; > + > + ret = imc_event_info(name, events); > + if (ret) > + return ret; > + sprintf(events->ev_value, "event=0x%x", val); > + > + return 0; > +} > + > +static int set_event_property(struct property *pp, char *event_prop, > + struct imc_events *events, char *ev_name) > +{ > + char *buf; > + int ret; > + > + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL); > + if (!buf) > + return -ENOMEM; > + > + sprintf(buf, "%s.%s", ev_name, event_prop); > + ret = imc_event_info_str(pp, buf, events); > + if (ret) { > + kfree(events->ev_name); > + kfree(events->ev_value); How do you know the buffer for ev_value was successfully allocated? Can you be sure it is safe to free? (Consider imc_event_info returning -ENOMEM) > + } > + > + return ret; > +} > + > +/* > + * imc_events_node_parser: Parse the event node "dev" and assign the parsed > + * information to event "events". > + * > + * Parses the "reg" property of this event. "reg" gives us the event offset. > + * Also, parse the "scale" and "unit" properties, if any. > + */ > +s
Re: [PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module
Hi all, I'm trying to get my head around these patches - at this point I'm just doing a first pass, so I may have more substantive structural comments later on. In the mean time - here are some minor C nits: > + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation. > + * (C) 2016 Hemant K Shaw, IBM Corporation. Should these be bumped to 2017? > + > + do { > + pages = PAGE_SIZE * i; > + pcni->vbase[i++] = (u64)phys_to_virt(pcni->pbase + > + pages); > + } while (i < (pcni->size / PAGE_SIZE)); > + } I had to scroll back up to the top of this function to make sure I understood what this loop does. Would it be better to write it as: for (i = 0; i < (pcni->size / PAGE_SIZE); i++) { pages = PAGE_SIZE * i; pcni->vbase[i] = (u64) } And, just checking - this is expected to work on both 4 and 64kB pages? > + > + return 0; > +err: > + return -ENODEV; You're not releasing any resources here - would it be better to just replace the gotos with this return? I haven't checked to see if you change the function later on to allocate memory - if so please ignore :) > +} > + > +static const struct of_device_id opal_imc_match[] = { > + { .compatible = IMC_DTB_COMPAT }, > + {}, > +}; > + > +static struct platform_driver opal_imc_driver = { > + .driver = { > + .name = "opal-imc-counters", > + .of_match_table = opal_imc_match, > + }, > + .probe = opal_imc_counters_probe, > +}; > + > +MODULE_DEVICE_TABLE(of, opal_imc_match); > +module_platform_driver(opal_imc_driver); > +MODULE_DESCRIPTION("PowerNV OPAL IMC driver"); > +MODULE_LICENSE("GPL"); > diff --git a/arch/powerpc/platforms/powernv/opal.c > b/arch/powerpc/platforms/powernv/opal.c > index e0f856bfbfe8..85ea1296f030 100644 > --- a/arch/powerpc/platforms/powernv/opal.c > +++ b/arch/powerpc/platforms/powernv/opal.c > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -30,6 +31,7 @@ > #include > #include > #include > +#include > > #include "powernv.h" > > @@ -631,6 +633,15 @@ static void opal_pdev_init(const char *compatible) > of_platform_device_create(np, NULL, NULL); > } > > +static void opal_imc_init_dev(void) > +{ > + struct device_node *np; > + > + np = of_find_compatible_node(NULL, NULL, IMC_DTB_COMPAT); > + if (np) > + of_platform_device_create(np, NULL, NULL); > +} Should this function be tagged __init? > + > static int kopald(void *unused) > { > unsigned long timeout = msecs_to_jiffies(opal_heartbeat) + 1; > @@ -704,6 +715,9 @@ static int __init opal_init(void) > /* Setup a heatbeat thread if requested by OPAL */ > opal_init_heartbeat(); > > + /* Detect IMC pmu counters support and create PMUs */ > + opal_imc_init_dev(); > + > /* Create leds platform devices */ > leds = of_find_node_by_path("/ibm,opal/leds"); > if (leds) { > -- > 2.7.4
Re: [RFC][PATCH] spin loop arch primitives for busy waiting
On Mon, Apr 3, 2017 at 4:50 PM, Nicholas Piggin wrote: > > POWER does not have an instruction like pause. We can only set current > thread priority, and current implementations do something like allocate > issue cycles to threads based on relative priorities. So there should > be at least one or two issue cycles at low priority, but ideally we > would not be changing priority in the busy-wait loop because it can > impact other threads in the core. > > I couldn't think of a good way to improve cpu_relax. Our (open source) > firmware has a cpu_relax, and it puts a bunch of nops between low and > normal priority instructions so we get some fetch cycles at low prio. > That isn't ideal though. > > If you have any ideas, I'd be open to them. So the idea would be that maybe we can just make those things explicit. IOW, instead of having that magical looping construct that does other magical hidden things as part of the loop, maybe we can just have a begin_cpu_relax(); while (!cond) cpu_relax(); end_cpu_relax(); and then architectures can decide how they implement it. So for x86, the begin/end macros would be empty. For ppc, maybe begin/end would be the "lower and raise priority", while cpu_relax() itself is an empty thing. Or maybe "begin" just clears a counter, while "cpu_relax()" does some "increase iterations, and lower priority after X iterations", and then "end" raises the priority again. The "do magic having a special loop" approach disturbs me. I'd much rather have more explicit hooks that allow people to do their own loop semantics (including having a "return" to exit early). But that depends on architectures having some pattern that we *can* abstract. Would some "begin/in-loop/end" pattern like the above be sufficient? The pure "in-loop" case we have now (ie "cpu_relax()" clearly isn't sufficient. I think s390 might have issues too, since they tried to have that "cpu_relax_yield" thing (which is only used by stop_machine), and they've tried cpu_relax_lowlatency() and other games. Linus
Re: [RFC PATCH 4/5] dt-bindings: soc/fsl: Update reserved memory binding for QBMan
On Mon, 2017-04-03 at 19:49 +, Roy Pledge wrote: > On 4/3/2017 11:42 AM, Rob Herring wrote: > > > > On Wed, Mar 29, 2017 at 05:13:56PM -0400, Roy Pledge wrote: > > > > > > Updates the QMan and BMan device tree bindings for reserved memory > > > nodes. This makes the reserved memory allocation compatiable with > > s/compatiable/compatible/ > > > > > > > > the shared-dma-pool usage. > > This change is not backwards compatible. Please state that and explain > > why that is okay. If PPC needs to not change, then the old strings and > > properties should remain, but deprecated. > I think I can make the old device trees compatible since the > "compatible" string changed without too much effort or ifdefery in the > code. However I would like to eventually see all PPC users move to the > new mode as I do believe it is more in alignment with the spirit of the > reserved-memory framework that is in the kernel. How much benefit is there to changing PPC if you have to retain the old method anyway for compatibility? Whereas if you don't convert, you retain test coverage for the old device trees, and don't have to worry about the memblock_end_of_DRAM() questions. > Do I need to mention the old mode in the binding.txt file? Yes. > I'm trying to keep things clean > so someone reading the binding doesn't get a headache and trying to > preserve my own sanity as we move forward with adding new features to > this driver. If they're sufficiently different then just describe the new and old nodes separately rather than putting a bunch of if/else clauses in there. -Scott
Re: [RFC PATCH 2/5] soc/fsl/qbman: Use shared-dma-pool for QMan private memory allocations
On Mon, 2017-04-03 at 15:52 +0100, Robin Murphy wrote: > On 01/04/17 08:25, Scott Wood wrote: > > > > On Fri, 2017-03-31 at 18:55 +0100, Robin Murphy wrote: > > > > > > On 31/03/17 04:27, Michael Ellerman wrote: > > > > > > > > > > > > Robin Murphy writes: > > > > > > > > > > > > > > > > > > > Hi Roy, > > > > > > > > > > On 29/03/17 22:13, Roy Pledge wrote: > > > > > > > > > > > > > > > > > > Use the shared-memory-pool mechanism for frame queue descriptor > > > > > > and > > > > > > packed frame descriptor record area allocations. > > > > > Thanks for persevering with this - in my opinion it's now looking > > > > > like > > > > > it was worth the effort :) > > > > > > > > > > AFAICS the ioremap_wc() that this leads to does appear to give back > > > > > something non-cacheable on PPC (assuming "pgprot_noncached_wc" isn't > > > > > horrendously misnamed), and "no-map" should rule out any cacheable > > > > > linear map alias existing, so it would seem that this approach > > > > > should > > > > > avert Scott's concerns about attribute mismatches. > > > > How does 'no-map' translate into something being excluded from the > > > > linear mapping? > > > Reserved regions marked with "no-map" get memblock_remove()d by > > > early_init_dt_alloc_reserved_memory_arch(). As I understand things, the > > > linear map should only cover memblock areas, and it would be explicitly > > > violating the semantics of "no-map" to still cover such a region. > > Discontiguous memory isn't supported on these PPC chips. Everything up to > > memblock_end_of_DRAM() gets mapped -- and if that were to change, the > > fragmentation would waste TLB1 entries. > Ah, so the "PPC-specific angles I'm not aware of" category is indeed > non-empty - I guess the lack of HAVE_GENERIC_DMA_COHERENT might be > related, then. > > That said, though, AFAICS only certain x86 and s390 configurations ever > call memblock_set_bottom_up(true), so we should be able to assume that > the reserved region allocations always fall through to > __memblock_find_range_top_down(). Thus if your DRAM is contiguous, then > "no-map"-ing the reserved regions will simply end up pushing > memblock_end_of_DRAM() down in a manner that would appear to still avoid > overlaps. Can you guarantee it will be at the end? What if there were other early memblock allocations (e.g. other reserved-memory regions without no-map) that came first? > I can only see that going wrong if the end of DRAM wasn't at > least 32MB aligned to begin with - is that ever likely to happen in > practice? Probably not, unless the user asks for an unusual memory size via mem= or similar. -Scott
Re: [PATCH 3/3] cpuidle: powernv: Avoid a branch in the core snooze_loop() loop
On Tue, 4 Apr 2017 07:54:14 +1000 Anton Blanchard wrote: > From: Anton Blanchard > > When in the snooze_loop() we want to take up the least amount of > resources. On my version of gcc (6.3), we end up with an extra > branch because it predicts snooze_timeout_en to be false, whereas it > is almost always true. > > Use likely() to avoid the branch and be a little nicer to the > other non idle threads on the core. Patches 2 and 3 look fine. Should they be replicated to cpuidle-pseries.c as well? Thanks, Nick > > Signed-off-by: Anton Blanchard > --- > drivers/cpuidle/cpuidle-powernv.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index 8c991c254b95..251a60bfa8ee 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -58,7 +58,7 @@ static int snooze_loop(struct cpuidle_device *dev, > ppc64_runlatch_off(); > HMT_very_low(); > while (!need_resched()) { > - if (snooze_timeout_en && get_tb() > snooze_exit_time) > + if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time) > break; > } >
Re: [PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority
On Tue, 4 Apr 2017 07:54:12 +1000 Anton Blanchard wrote: > From: Anton Blanchard > > The core of snooze_loop() continually bounces between low and very > low thread priority. Changing thread priorities is an expensive > operation that can negatively impact other threads on a core. > > All CPUs that can run PowerNV support very low priority, so we can > avoid the change completely. This looks good. I have HMT_lowest() which does alt feature patching we can use for pseries and default idle code.
Re: [RFC][PATCH] spin loop arch primitives for busy waiting
On Mon, 3 Apr 2017 08:31:30 -0700 Linus Torvalds wrote: > On Mon, Apr 3, 2017 at 1:13 AM, Nicholas Piggin wrote: > > > > The loops have some restrictions on what can be used, but they are > > intended to be small and simple so it's not generally a problem: > > - Don't use cpu_relax. > > - Don't use return or goto. > > - Don't use sleeping or spinning primitives. > > So you're supposed to "break" out of the loop if you want to exit > early? Or what? Yes, break (I put that as a comment somewhere). > One of the issues is that with a do-while/until loop, at least the way > you've coded it, it's always done at least once. > > Which means that people will have to code the condition as > > if (cond) { > .. fast case.. > return; > } > > spin_do { >... > } spin_until (cond); > .. slow case .. > > because "cpu_relax()" itself can be slightly slow. > > And the way you've done it, even if there's a "break" in the loop, the > cpu_relax() is still done (because it's done at the top). > > So quite frankly, I think "while(cond) ()" semantics would be better > than "do { } while (cond)". > > Now, a lot of loops *are* of the kind where we've already handled the > fast case earlier, so by the time we get into the loop we really are > waiting for the condition to become true (but we knew it started out > false). But not all. Yeah that's a good point, I see what you mean. The important cases I've been looked at (e.g., mutex spinning, bit spinlock, idle loops) are the slowpath cases. But this would need to be explicit in the API so fastpaths don't use it. > Doing a quick > > git grep -2 cpu_relax > > for existing users of cpu_relax() does imply that most of the current > users are very much of the "cpu_relax() at the _end_ of the loop > tests" type. > > So I don't know. I think the interface sucks. It's a bit clunky. > What is it that POWER _actually_ wants? Not the loop - the > "cpu_relax()" kind of thing. Can we abstract *that* better? POWER does not have an instruction like pause. We can only set current thread priority, and current implementations do something like allocate issue cycles to threads based on relative priorities. So there should be at least one or two issue cycles at low priority, but ideally we would not be changing priority in the busy-wait loop because it can impact other threads in the core. I couldn't think of a good way to improve cpu_relax. Our (open source) firmware has a cpu_relax, and it puts a bunch of nops between low and normal priority instructions so we get some fetch cycles at low prio. That isn't ideal though. If you have any ideas, I'd be open to them. I had a secondary motive for defining it as a loop, and that was using branch hints or even asm goto to arrange the loop and exit branch nicely. E.g., if we static predict loop exits, then we don't take a branch miss as first thing when condition changes. This is a second order concern though. Thanks, Nick
Re: [PATCH] powerpc/misc: fix exported functions that reference the TOC
On Mon, 2017-04-03 at 23:29 +1000, Michael Ellerman wrote: > The other option would be just to make a rule that anything EXPORT'ed > must use _GLOBAL_TOC(). Can we enforce that somewhat at build time ? Cheers, Ben.
Re: [PATCH] powerpc: Avoid taking a data miss on every userspace instruction miss
Hi Christophe, > > - if (user_mode(regs)) > > + if (!is_exec && user_mode(regs)) > > Shouldn't it also check 'is_write' ? > If it is a store, is_write should be set, shouldn't it ? Thanks, Ben had the same suggestion. I'll add that further optimisation in a subsequent patch. Anton
[PATCH 3/3] cpuidle: powernv: Avoid a branch in the core snooze_loop() loop
From: Anton Blanchard When in the snooze_loop() we want to take up the least amount of resources. On my version of gcc (6.3), we end up with an extra branch because it predicts snooze_timeout_en to be false, whereas it is almost always true. Use likely() to avoid the branch and be a little nicer to the other non idle threads on the core. Signed-off-by: Anton Blanchard --- drivers/cpuidle/cpuidle-powernv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 8c991c254b95..251a60bfa8ee 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -58,7 +58,7 @@ static int snooze_loop(struct cpuidle_device *dev, ppc64_runlatch_off(); HMT_very_low(); while (!need_resched()) { - if (snooze_timeout_en && get_tb() > snooze_exit_time) + if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time) break; } -- 2.11.0
[PATCH 2/3] cpuidle: powernv: Don't continually set thread priority in snooze_loop()
From: Anton Blanchard The powerpc64 kernel exception handlers have preserved thread priorities for a long time now, so there is no need to continually set it. Just set it once on entry and once exit. Signed-off-by: Anton Blanchard --- drivers/cpuidle/cpuidle-powernv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 9d9f164894eb..8c991c254b95 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -56,8 +56,8 @@ static int snooze_loop(struct cpuidle_device *dev, snooze_exit_time = get_tb() + snooze_timeout; ppc64_runlatch_off(); + HMT_very_low(); while (!need_resched()) { - HMT_very_low(); if (snooze_timeout_en && get_tb() > snooze_exit_time) break; } -- 2.11.0
[PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority
From: Anton Blanchard The core of snooze_loop() continually bounces between low and very low thread priority. Changing thread priorities is an expensive operation that can negatively impact other threads on a core. All CPUs that can run PowerNV support very low priority, so we can avoid the change completely. Signed-off-by: Anton Blanchard --- drivers/cpuidle/cpuidle-powernv.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index cda8f62d555b..9d9f164894eb 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -57,7 +57,6 @@ static int snooze_loop(struct cpuidle_device *dev, snooze_exit_time = get_tb() + snooze_timeout; ppc64_runlatch_off(); while (!need_resched()) { - HMT_low(); HMT_very_low(); if (snooze_timeout_en && get_tb() > snooze_exit_time) break; -- 2.11.0
Re: [PATCH] raid6/altivec: adding vpermxor implementation for raid6 Q syndrome
> In that function, the flow is: > pagefault_disable(); > enable_kernel_altivec(); > > pagefault_enable(); > > There are a few things that it would be nice (but by no means essential) > to find out: > - what is the difference between pagefault and prempt enable/disable > - is it required to disable altivec after the end of the function or >can we leave that enabled? Answering my own question here, dc4fbba11e46 ("powerpc: Create disable_kernel_{fp,altivec,vsx,spe}()") adds the disable_ function, and it's a no-op except under debug conditions. So it should stay. Regards, Daniel > >> + >> +int raid6_have_altivec_vpermxor(void); >> +#if $# == 1 >> +int raid6_have_altivec_vpermxor(void) >> +{ >> +/* Check if CPU has both altivec and the vpermxor instruction*/ > Please add a space: s|ion*/|ion */| >> +# ifdef __KERNEL__ >> +return (cpu_has_feature(CONFIG_ALTIVEC) && >> +cpu_has_feature(CPU_FTR_ARCH_207S)); > I assume this is future-proof - an ISA 3.00 cpu will advertise 2.07S > compat? > >> +# else >> +return 1; >> +#endif >> + >> +} >> +#endif >> + >> +const struct raid6_calls raid6_vpermxor$# = { >> +raid6_vpermxor$#_gen_syndrome, >> +NULL, >> +raid6_have_altivec_vpermxor, >> +"vpermxor$#", >> +0 >> +}; >> +#endif >> -- >> 2.9.3 > > Apart from that this patch looks good and I expect I will be able to > formally Review v2. > > Regards, > Daniel
Re: [RFC PATCH 4/5] dt-bindings: soc/fsl: Update reserved memory binding for QBMan
On 4/3/2017 11:42 AM, Rob Herring wrote: > On Wed, Mar 29, 2017 at 05:13:56PM -0400, Roy Pledge wrote: >> Updates the QMan and BMan device tree bindings for reserved memory >> nodes. This makes the reserved memory allocation compatiable with > s/compatiable/compatible/ > >> the shared-dma-pool usage. > This change is not backwards compatible. Please state that and explain > why that is okay. If PPC needs to not change, then the old strings and > properties should remain, but deprecated. I think I can make the old device trees compatible since the "compatible" string changed without too much effort or ifdefery in the code. However I would like to eventually see all PPC users move to the new mode as I do believe it is more in alignment with the spirit of the reserved-memory framework that is in the kernel. Do I need to mention the old mode in the binding.txt file? I'm trying to keep things clean so someone reading the binding doesn't get a headache and trying to preserve my own sanity as we move forward with adding new features to this driver. Roy
Re: [PATCH] powerpc: Avoid taking a data miss on every userspace instruction miss
Anton Blanchard a écrit : From: Anton Blanchard Early on in do_page_fault() we call store_updates_sp(), regardless of the type of exception. For an instruction miss this doesn't make sense, because we only use this information to detect if a data miss is the result of a stack expansion instruction or not. Worse still, it results in a data miss within every userspace instruction miss handler, because we try and load the very instruction we are about to install a pte for! A simple exec microbenchmark runs 6% faster on POWER8 with this fix: #include #include #include int main(int argc, char *argv[]) { unsigned long left = atol(argv[1]); char leftstr[16]; if (left-- == 0) return 0; sprintf(leftstr, "%ld", left); execlp(argv[0], argv[0], leftstr, NULL); perror("exec failed\n"); return 0; } Pass the number of iterations on the command line (eg 1) and time how long it takes to execute. Signed-off-by: Anton Blanchard --- arch/powerpc/mm/fault.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index fd6484fc2fa9..3a7d580fdc59 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -287,7 +287,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, * can result in fault, which will cause a deadlock when called with * mmap_sem held */ - if (user_mode(regs)) + if (!is_exec && user_mode(regs)) Shouldn't it also check 'is_write' ? If it is a store, is_write should be set, shouldn't it ? Christophe store_update_sp = store_updates_sp(regs); if (user_mode(regs)) -- 2.11.0
Re: [v10, 7/7] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0
Sorry to post in this huge email bunch. I have most probably hit an errata in Freescale T4240 for PPC_DISABLE_THREADS. I'm using Rev2 - T4240. Is this Errata required to be taken care or not? Any quick help is appreciated! My issue: I'm running line rate of Traffic to T4240 [10G of traffic on each port of capacity of 10G]; After few hours of running of traffic, My CPU/LMP gets stuck and goes in for Hard reset. When I searched through the code and some open forums, I saw the errata listed above. I'm not sure if that errata works for me or not. I have pasted a snapshot of issue occuring in my system During hang, with Softlock up enabled, I get prints from smp_many -> showing ‘every processor is waiting for processor 22’ ; I need to know what happened to processor 22. The processor number keep changing every time when I run the traffic. Processor 22 goes into a deadlock state with interrupts disabled or it went into a deep idle sleep state is the issue i feel. If its deep idle state -> a NMI should have recovered it. But if it’s a deadlock issue with interrupt disabled then I need to know the root cause for the deadlock. *root@A@0-1-1:~# [ 1602.261011] Current::17 waiting::22 flag::4353 [ 1602.720488] Current::14 waiting::22 flag::3585 [ 1602.756404] Current::15 waiting::22 flag::3841 [ 1604.903248] Current::16 waiting::22 flag::4097 [ 1619.400917] Current::14 waiting::22 flag::3585 [ 1619.517870] Current::17 waiting::22 flag::4353 [ 1619.749893] Current::15 waiting::22 flag::3841 [ 1619.798453] Current::4 waiting::22 flag::1025 [ 1622.177171] Current::16 waiting::22 flag::4097 [ 1622.449412] INFO: rcu_preempt detected stalls on CPUs/tasks: { 22} (detected by 4, t=21008 jiffies, g=101651, c=101650, q=4713) [ 1622.460951] Task dump for CPU 22:* *[ 1622.464275] swapper/22 R running task0 0 1 0x0800* *[ 1622.471355] Call Trace:* *[ 1622.473820] [c001f92b78d0] [009e] 0x9e (unreliable) [ 1622.480119] [c001f92b7960] [c001f92b7aa0] 0xc001f92b7aa0 [ 1622.486497] [c001f92b79e0] [c0a90275] 0xc0a90275 [ 1622.492878] [c001f92b7a60] [c0006f64] .do_IRQ+0x184/0x370 [ 1622.499344] [c001f92b7b10] [c001b93c] exc_0x500_common+0xfc/0x100 [ 1622.506539] --- Exception: 501 at 0xc0a3e200* *[ 1622.506539] LR = .__check_irq_replay+0x68/0x110* *[ 1622.516388] [c001f92b7e00] [c00bc590] .cpu_startup_entry+0x1d0/0x350 (unreliable) [ 1622.524950] [c001f92b7ed0] [c0a00460] .start_secondary+0x3ec/0x3f4 [ 1622.532197] [c001f92b7f90] [c36c] .start_secondary_prolog+0x10/0x14 [ 1635.237829] Current::3 waiting::5 flag::769 [ 1636.587746] Current::14 waiting::22 flag::3585 [ 1637.082117] Current::17 waiting::22 flag::4353 [ 1637.224920] Current::4 waiting::22 flag::1025 [ 1637.268789] Current::15 waiting::22 flag::3841 [ 1640.085792] Current::16 waiting::22 flag::4097 [ 1651.093367] Current::4 waiting::22 flag::1025* *=* Regards, Kiran On Thu, May 5, 2016 at 4:40 PM, Arnd Bergmann wrote: > On Thursday 05 May 2016 09:41:32 Yangbo Lu wrote: > > > -Original Message- > > > From: Arnd Bergmann [mailto:a...@arndb.de] > > > Sent: Thursday, May 05, 2016 4:32 PM > > > To: linuxppc-dev@lists.ozlabs.org > > > Cc: Yangbo Lu; linux-...@vger.kernel.org; devicet...@vger.kernel.org; > > > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; > > > linux-...@vger.kernel.org; linux-...@vger.kernel.org; > iommu@lists.linux- > > > foundation.org; net...@vger.kernel.org; Mark Rutland; > > > ulf.hans...@linaro.org; Russell King; Bhupesh Sharma; Joerg Roedel; > > > Santosh Shilimkar; Yang-Leo Li; Scott Wood; Rob Herring; Claudiu > Manoil; > > > Kumar Gala; Xiaobo Xie; Qiang Zhao > > > Subject: Re: [v10, 7/7] mmc: sdhci-of-esdhc: fix host version for > T4240- > > > R1.0-R2.0 > > > > > > On Thursday 05 May 2016 11:12:30 Yangbo Lu wrote: > > > > > > > > + fsl_guts_init(); > > > > + svr = fsl_guts_get_svr(); > > > > + if (svr) { > > > > + esdhc->soc_ver = SVR_SOC_VER(svr); > > > > + esdhc->soc_rev = SVR_REV(svr); > > > > + } else { > > > > + dev_err(&pdev->dev, "Failed to get SVR value!\n"); > > > > + } > > > > + > > > > > > > > > > > > > Sorry for jumping in again after not participating in the discussion > for > > > the past few versions. > > > > > > What happened to my suggestion of making this a platform-independent > > > interface to avoid the link time dependency? > > > > > > Specifically, why not add an exported function to drivers/base/soc.c > that > > > uses glob_match() for comparing a string in the device driver to the ID > > > of the SoC that is set by whatever SoC identifying driver the platform > > > has? > > > > [Lu Yangbo-B47093] I think this has been discussed in v6. > > You can find Scott's c
Re: [RFC PATCH 4/5] dt-bindings: soc/fsl: Update reserved memory binding for QBMan
On Wed, Mar 29, 2017 at 05:13:56PM -0400, Roy Pledge wrote: > Updates the QMan and BMan device tree bindings for reserved memory > nodes. This makes the reserved memory allocation compatiable with s/compatiable/compatible/ > the shared-dma-pool usage. This change is not backwards compatible. Please state that and explain why that is okay. If PPC needs to not change, then the old strings and properties should remain, but deprecated. > > Signed-off-by: Roy Pledge > --- > Documentation/devicetree/bindings/soc/fsl/bman.txt | 10 +- > Documentation/devicetree/bindings/soc/fsl/qman.txt | 16 +--- > 2 files changed, 14 insertions(+), 12 deletions(-)
[PATCH v6 11/11] powerpc/perf: Thread imc cpuhotplug support
From: Anju T Sudhakar This patch adds support for thread IMC on cpuhotplug. When a cpu goes offline, the LDBAR for that cpu is disabled, and when it comes back online the previous ldbar value is written back to the LDBAR for that cpu. To register the hotplug functions for thread_imc, a new state CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE is added to the list of existing states. Reviewed-by: Gautham R. Shenoy Signed-off-by: Anju T Sudhakar Signed-off-by: Madhavan Srinivasan --- arch/powerpc/perf/imc-pmu.c | 32 +++- include/linux/cpuhotplug.h | 1 + 2 files changed, 28 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index 3db637c6d3f4..944e568b52ff 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -726,6 +726,16 @@ static void cleanup_all_thread_imc_memory(void) on_each_cpu(cleanup_thread_imc_memory, NULL, 1); } +static void thread_imc_update_ldbar(unsigned int cpu_id) +{ + u64 ldbar_addr, ldbar_value; + + ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]); + ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) | + (u64)THREAD_IMC_ENABLE; + mtspr(SPRN_LDBAR, ldbar_value); +} + /* * Allocates a page of memory for each of the online cpus, and, writes the * physical base address of that page to the LDBAR for that cpu. This starts @@ -733,21 +743,33 @@ static void cleanup_all_thread_imc_memory(void) */ static void thread_imc_mem_alloc(void *dummy) { - u64 ldbar_addr, ldbar_value; int cpu_id = smp_processor_id(); int phys_id = topology_physical_package_id(smp_processor_id()); per_cpu_add[cpu_id] = (u64)alloc_pages_exact_nid(phys_id, (size_t)IMC_THREAD_COUNTER_MEM, GFP_KERNEL | __GFP_ZERO); - ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]); - ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) | - (u64)THREAD_IMC_ENABLE; - mtspr(SPRN_LDBAR, ldbar_value); + thread_imc_update_ldbar(cpu_id); +} + +static int ppc_thread_imc_cpu_online(unsigned int cpu) +{ + thread_imc_update_ldbar(cpu); + return 0; } +static int ppc_thread_imc_cpu_offline(unsigned int cpu) +{ + mtspr(SPRN_LDBAR, 0); + return 0; + } + void thread_imc_cpu_init(void) { on_each_cpu(thread_imc_mem_alloc, NULL, 1); + cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE, + "POWER_THREAD_IMC_ONLINE", + ppc_thread_imc_cpu_online, + ppc_thread_imc_cpu_offline); } /* diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index abde85d9511a..724df46b2c3c 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -138,6 +138,7 @@ enum cpuhp_state { CPUHP_AP_PERF_ARM_L2X0_ONLINE, CPUHP_AP_PERF_POWERPC_NEST_ONLINE, CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE, + CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE, CPUHP_AP_PERF_ARM_QCOM_L2_ONLINE, CPUHP_AP_WORKQUEUE_ONLINE, CPUHP_AP_RCUTREE_ONLINE, -- 2.7.4
[PATCH v6 10/11] powerpc/perf: Thread IMC PMU functions
From: Hemant Kumar This patch adds the PMU functions required for event initialization, read, update, add, del etc. for thread IMC PMU. Thread IMC PMUs are used for per-task monitoring. These PMUs don't need any hotplugging support. For each CPU, a page of memory is allocated and is kept static i.e., these pages will exist till the machine shuts down. The base address of this page is assigned to the ldbar of that cpu. As soon as we do that, the thread IMC counters start running for that cpu and the data of these counters are assigned to the page allocated. But we use this for per-task monitoring. Whenever we start monitoring a task, the event is added is onto the task. At that point, we read the initial value of the event. Whenever, we stop monitoring the task, the final value is taken and the difference is the event data. Now, a task can move to a different cpu. Suppose a task X is moving from cpu A to cpu B. When the task is scheduled out of A, we get an event_del for A, and hence, the event data is updated. And, we stop updating the X's event data. As soon as X moves on to B, event_add is called for B, and we again update the event_data. And this is how it keeps on updating the event data even when the task is scheduled on to different cpus. Signed-off-by: Anju T Sudhakar Signed-off-by: Hemant Kumar Signed-off-by: Madhavan Srinivasan --- arch/powerpc/include/asm/imc-pmu.h| 5 + arch/powerpc/perf/imc-pmu.c | 173 +- arch/powerpc/platforms/powernv/opal-imc.c | 3 + 3 files changed, 180 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index c63bc78fd6f6..42f0149886b4 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -25,12 +25,16 @@ #define IMC_NEST_MAX_PAGES 16 #define IMC_CORE_COUNTER_MEM 8192 +#define IMC_THREAD_COUNTER_MEM 8192 #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters" #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest" #define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core" #define IMC_DTB_THREAD_COMPAT "ibm,imc-counters-thread" +#define THREAD_IMC_LDBAR_MASK 0x0003e000 +#define THREAD_IMC_ENABLE 0x8000 + /* * Structure to hold per chip specific memory address * information for nest pmus. Nest Counter data are exported @@ -73,4 +77,5 @@ struct imc_pmu { int imc_get_domain(struct device_node *pmu_dev); void core_imc_disable(void); +void thread_imc_disable(void); #endif /* PPC_POWERNV_IMC_PMU_DEF_H */ diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index 35b3564747e2..3db637c6d3f4 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -30,6 +30,9 @@ static u64 per_core_pdbar_add[IMC_MAX_CHIPS][IMC_MAX_CORES]; static cpumask_t core_imc_cpumask; struct imc_pmu *core_imc_pmu; +/* Maintains base address for all the cpus */ +static u64 per_cpu_add[NR_CPUS]; + /* Needed for sanity check */ extern u64 nest_max_offset; extern u64 core_max_offset; @@ -461,6 +464,56 @@ static int core_imc_event_init(struct perf_event *event) return 0; } +static int thread_imc_event_init(struct perf_event *event) +{ + struct task_struct *target; + + if (event->attr.type != event->pmu->type) + return -ENOENT; + + /* Sampling not supported */ + if (event->hw.sample_period) + return -EINVAL; + + event->hw.idx = -1; + + /* Sanity check for config (event offset) */ + if (event->attr.config > thread_max_offset) + return -EINVAL; + + target = event->hw.target; + + if (!target) + return -EINVAL; + + event->pmu->task_ctx_nr = perf_sw_context; + return 0; +} + +static void thread_imc_read_counter(struct perf_event *event) +{ + u64 *addr, data; + int cpu_id = smp_processor_id(); + + addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config); + data = __be64_to_cpu(READ_ONCE(*addr)); + local64_set(&event->hw.prev_count, data); +} + +static void thread_imc_perf_event_update(struct perf_event *event) +{ + u64 counter_prev, counter_new, final_count, *addr; + int cpu_id = smp_processor_id(); + + addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config); + counter_prev = local64_read(&event->hw.prev_count); + counter_new = __be64_to_cpu(READ_ONCE(*addr)); + final_count = counter_new - counter_prev; + + local64_set(&event->hw.prev_count, counter_new); + local64_add(final_count, &event->count); +} + static void imc_read_counter(struct perf_event *event) { u64 *addr, data; @@ -511,6 +564,53 @@ static int imc_event_add(struct perf_event *event, int flags) return 0; } +static void thread_imc_event_start(struct perf_event *event, int flags) +{ + thread_i
[PATCH v6 09/11] powerpc/powernv: Thread IMC events detection
From: Hemant Kumar Patch adds support for detection of thread IMC events. It adds a new domain IMC_DOMAIN_THREAD and it is determined with the help of the compatibility string "ibm,imc-counters-thread" based on the IMC device tree. Signed-off-by: Anju T Sudhakar Signed-off-by: Hemant Kumar Signed-off-by: Madhavan Srinivasan --- arch/powerpc/include/asm/imc-pmu.h| 2 ++ arch/powerpc/perf/imc-pmu.c | 1 + arch/powerpc/platforms/powernv/opal-imc.c | 9 - 3 files changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 4aa63191456a..c63bc78fd6f6 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -29,6 +29,7 @@ #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters" #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest" #define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core" +#define IMC_DTB_THREAD_COMPAT "ibm,imc-counters-thread" /* * Structure to hold per chip specific memory address @@ -67,6 +68,7 @@ struct imc_pmu { */ #define IMC_DOMAIN_NEST1 #define IMC_DOMAIN_CORE2 +#define IMC_DOMAIN_THREAD 3 #define IMC_DOMAIN_UNKNOWN -1 int imc_get_domain(struct device_node *pmu_dev); diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index 45f9b35142a7..35b3564747e2 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -33,6 +33,7 @@ struct imc_pmu *core_imc_pmu; /* Needed for sanity check */ extern u64 nest_max_offset; extern u64 core_max_offset; +extern u64 thread_max_offset; PMU_FORMAT_ATTR(event, "config:0-20"); static struct attribute *imc_format_attrs[] = { diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index f261fc933959..ac625cf13875 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -40,6 +40,7 @@ extern int init_imc_pmu(struct imc_events *events, int idx, struct imc_pmu *pmu_ptr); u64 nest_max_offset; u64 core_max_offset; +u64 thread_max_offset; static int imc_event_info(char *name, struct imc_events *events) { @@ -87,6 +88,10 @@ static void update_max_value(u32 value, int pmu_domain) if (core_max_offset < value) core_max_offset = value; break; + case IMC_DOMAIN_THREAD: + if (thread_max_offset < value) + thread_max_offset = value; + break; default: /* Unknown domain, return */ return; @@ -240,6 +245,8 @@ int imc_get_domain(struct device_node *pmu_dev) return IMC_DOMAIN_NEST; if (of_device_is_compatible(pmu_dev, IMC_DTB_CORE_COMPAT)) return IMC_DOMAIN_CORE; + if (of_device_is_compatible(pmu_dev, IMC_DTB_THREAD_COMPAT)) + return IMC_DOMAIN_THREAD; else return IMC_DOMAIN_UNKNOWN; } @@ -278,7 +285,7 @@ static void imc_free_events(struct imc_events *events, int nr_entries) /* * imc_pmu_create : Takes the parent device which is the pmu unit and a * pmu_index as the inputs. - * Allocates memory for the pmu, sets up its domain (NEST or CORE), and + * Allocates memory for the pmu, sets up its domain (NEST/CORE/THREAD), and * allocates memory for the events supported by this pmu. Assigns a name for * the pmu. Calls imc_events_node_parser() to setup the individual events. * If everything goes fine, it calls, init_imc_pmu() to setup the pmu device -- 2.7.4
[PATCH v6 07/11] powerpc/powernv: Core IMC events detection
From: Hemant Kumar This patch adds support for detection of core IMC events along with the Nest IMC events. It adds a new domain IMC_DOMAIN_CORE and its determined with the help of the compatibility string "ibm,imc-counters-core" based on the IMC device tree. Signed-off-by: Anju T Sudhakar Signed-off-by: Hemant Kumar Signed-off-by: Madhavan Srinivasan --- arch/powerpc/include/asm/imc-pmu.h| 2 ++ arch/powerpc/perf/imc-pmu.c | 3 +++ arch/powerpc/platforms/powernv/opal-imc.c | 18 -- 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index e13f51047522..2c39603ff3e7 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -26,6 +26,7 @@ #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters" #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest" +#define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core" /* * Structure to hold per chip specific memory address @@ -63,6 +64,7 @@ struct imc_pmu { * Domains for IMC PMUs */ #define IMC_DOMAIN_NEST1 +#define IMC_DOMAIN_CORE2 #define IMC_DOMAIN_UNKNOWN -1 int imc_get_domain(struct device_node *pmu_dev); diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index b28835b861b3..728c67e139e0 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -19,8 +19,11 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; static cpumask_t nest_imc_cpumask; +struct imc_pmu *core_imc_pmu; + /* Needed for sanity check */ extern u64 nest_max_offset; +extern u64 core_max_offset; PMU_FORMAT_ATTR(event, "config:0-20"); static struct attribute *imc_format_attrs[] = { diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index a98678266b0d..f6f63399ab06 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -34,10 +34,12 @@ extern struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; +extern struct imc_pmu *core_imc_pmu; extern int init_imc_pmu(struct imc_events *events, int idx, struct imc_pmu *pmu_ptr); u64 nest_max_offset; +u64 core_max_offset; static int imc_event_info(char *name, struct imc_events *events) { @@ -81,6 +83,10 @@ static void update_max_value(u32 value, int pmu_domain) if (nest_max_offset < value) nest_max_offset = value; break; + case IMC_DOMAIN_CORE: + if (core_max_offset < value) + core_max_offset = value; + break; default: /* Unknown domain, return */ return; @@ -232,6 +238,8 @@ int imc_get_domain(struct device_node *pmu_dev) { if (of_device_is_compatible(pmu_dev, IMC_DTB_NEST_COMPAT)) return IMC_DOMAIN_NEST; + if (of_device_is_compatible(pmu_dev, IMC_DTB_CORE_COMPAT)) + return IMC_DOMAIN_CORE; else return IMC_DOMAIN_UNKNOWN; } @@ -299,7 +307,10 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index) goto free_pmu; /* Needed for hotplug/migration */ - per_nest_pmu_arr[pmu_index] = pmu_ptr; + if (pmu_ptr->domain == IMC_DOMAIN_CORE) + core_imc_pmu = pmu_ptr; + else if (pmu_ptr->domain == IMC_DOMAIN_NEST) + per_nest_pmu_arr[pmu_index] = pmu_ptr; /* * "events" property inside a PMU node contains the phandle value @@ -355,7 +366,10 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index) } /* Save the name to register it later */ - sprintf(buf, "nest_%s", (char *)pp->value); + if (pmu_ptr->domain == IMC_DOMAIN_NEST) + sprintf(buf, "nest_%s", (char *)pp->value); + else + sprintf(buf, "%s_imc", (char *)pp->value); pmu_ptr->pmu.name = (char *)buf; /* -- 2.7.4
[PATCH v6 06/11] powerpc/perf: IMC pmu cpumask and cpu hotplug support
From: Hemant Kumar Adds cpumask attribute to be used by each IMC pmu. Only one cpu (any online CPU) from each chip for nest PMUs is designated to read counters. On CPU hotplug, dying CPU is checked to see whether it is one of the designated cpus, if yes, next online cpu from the same chip (for nest units) is designated as new cpu to read counters. For this purpose, we introduce a new state : CPUHP_AP_PERF_POWERPC_NEST_ONLINE. Signed-off-by: Anju T Sudhakar Signed-off-by: Hemant Kumar Signed-off-by: Madhavan Srinivasan --- arch/powerpc/include/asm/opal-api.h| 13 ++- arch/powerpc/include/asm/opal.h| 3 + arch/powerpc/perf/imc-pmu.c| 155 - arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + include/linux/cpuhotplug.h | 1 + 5 files changed, 171 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index a0aa285869b5..23fc51e9d71d 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -168,7 +168,8 @@ #define OPAL_INT_SET_MFRR 125 #define OPAL_PCI_TCE_KILL 126 #define OPAL_NMMU_SET_PTCR 127 -#define OPAL_LAST 127 +#define OPAL_NEST_IMC_COUNTERS_CONTROL 149 +#define OPAL_LAST 149 /* Device tree flags */ @@ -928,6 +929,16 @@ enum { OPAL_PCI_TCE_KILL_ALL, }; +/* Operation argument to OPAL_NEST_IMC_COUNTERS_CONTROL */ +enum { + OPAL_NEST_IMC_PRODUCTION_MODE = 1, +}; + +enum { + OPAL_NEST_IMC_STOP, + OPAL_NEST_IMC_START, +}; + #endif /* __ASSEMBLY__ */ #endif /* __OPAL_API_H */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 1ff03a6da76e..d93d08204243 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -227,6 +227,9 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t kill_type, uint64_t dma_addr, uint32_t npages); int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr); +int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1, + uint64_t value2, uint64_t value3); + /* Internal functions */ extern int early_init_dt_scan_opal(unsigned long node, const char *uname, int depth, void *data); diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index bd5bf66bd920..b28835b861b3 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -17,6 +17,7 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; +static cpumask_t nest_imc_cpumask; /* Needed for sanity check */ extern u64 nest_max_offset; @@ -32,6 +33,152 @@ static struct attribute_group imc_format_group = { .attrs = imc_format_attrs, }; +/* Get the cpumask printed to a buffer "buf" */ +static ssize_t imc_pmu_cpumask_get_attr(struct device *dev, + struct device_attribute *attr, char *buf) +{ + cpumask_t *active_mask; + + active_mask = &nest_imc_cpumask; + return cpumap_print_to_pagebuf(true, buf, active_mask); +} + +static DEVICE_ATTR(cpumask, S_IRUGO, imc_pmu_cpumask_get_attr, NULL); + +static struct attribute *imc_pmu_cpumask_attrs[] = { + &dev_attr_cpumask.attr, + NULL, +}; + +static struct attribute_group imc_pmu_cpumask_attr_group = { + .attrs = imc_pmu_cpumask_attrs, +}; + +/* + * nest_init : Initializes the nest imc engine for the current chip. + */ +static void nest_init(int *loc) +{ + int rc; + + rc = opal_nest_imc_counters_control(OPAL_NEST_IMC_PRODUCTION_MODE, + OPAL_NEST_IMC_START, 0, 0); + if (rc) + loc[smp_processor_id()] = 1; +} + +static void nest_change_cpu_context(int old_cpu, int new_cpu) +{ + int i; + + for (i = 0; +(per_nest_pmu_arr[i] != NULL) && (i < IMC_MAX_PMUS); i++) + perf_pmu_migrate_context(&per_nest_pmu_arr[i]->pmu, + old_cpu, new_cpu); +} + +static int ppc_nest_imc_cpu_online(unsigned int cpu) +{ + int nid; + const struct cpumask *l_cpumask; + struct cpumask tmp_mask; + + /* Find the cpumask of this node */ + nid = cpu_to_node(cpu); + l_cpumask = cpumask_of_node(nid); + + /* +* If any of the cpu from this node is already present in the mask, +* just return, if not, then set this cpu in the mask. +*/ + if (!cpumask_and(&tmp_mask, l_cpumask, &nest_imc_cpumask)) { + cpumask_set_cpu(cpu, &nest_imc_cpumask); + return 0; + } + + return 0; +} + +static int ppc_nest_imc_cpu_offline(unsigned int cpu) +{ + int nid, target = -1; +
[PATCH v6 08/11] powerpc/perf: PMU functions for Core IMC and hotplugging
From: Hemant Kumar This patch adds the PMU function to initialize a core IMC event. It also adds cpumask initialization function for core IMC PMU. For initialization, a 8KB of memory is allocated per core where the data for core IMC counters will be accumulated. The base address for this page is sent to OPAL via an OPAL call which initializes various SCOMs related to Core IMC initialization. Upon any errors, the pages are free'ed and core IMC counters are disabled using the same OPAL call. For CPU hotplugging, a cpumask is initialized which contains an online CPU from each core. If a cpu goes offline, we check whether that cpu belongs to the core imc cpumask, if yes, then, we migrate the PMU context to any other online cpu (if available) in that core. If a cpu comes back online, then this cpu will be added to the core imc cpumask only if there was no other cpu from that core in the previous cpumask. To register the hotplug functions for core_imc, a new state CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE is added to the list of existing states. Patch also adds OPAL device shutdown callback. Needed to disable the IMC core engine to handle kexec. Signed-off-by: Hemant Kumar [Anju: Changed the condition for setting cpumask for core in imc_pmu_cpumask_get_attr() ] Signed-off-by: Anju T Sudhakar Signed-off-by: Madhavan Srinivasan --- arch/powerpc/include/asm/imc-pmu.h | 3 + arch/powerpc/include/asm/opal-api.h| 10 +- arch/powerpc/include/asm/opal.h| 2 + arch/powerpc/perf/imc-pmu.c| 267 - arch/powerpc/platforms/powernv/opal-imc.c | 13 +- arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + include/linux/cpuhotplug.h | 1 + 7 files changed, 287 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 2c39603ff3e7..4aa63191456a 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -21,8 +21,10 @@ #define IMC_MAX_CHIPS 32 #define IMC_MAX_PMUS 32 #define IMC_MAX_PMU_NAME_LEN 256 +#define IMC_MAX_CORES 32 #define IMC_NEST_MAX_PAGES 16 +#define IMC_CORE_COUNTER_MEM 8192 #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters" #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest" @@ -68,4 +70,5 @@ struct imc_pmu { #define IMC_DOMAIN_UNKNOWN -1 int imc_get_domain(struct device_node *pmu_dev); +void core_imc_disable(void); #endif /* PPC_POWERNV_IMC_PMU_DEF_H */ diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 23fc51e9d71d..971918deb793 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -169,7 +169,8 @@ #define OPAL_PCI_TCE_KILL 126 #define OPAL_NMMU_SET_PTCR 127 #define OPAL_NEST_IMC_COUNTERS_CONTROL 149 -#define OPAL_LAST 149 +#define OPAL_CORE_IMC_COUNTERS_CONTROL 150 +#define OPAL_LAST 150 /* Device tree flags */ @@ -939,6 +940,13 @@ enum { OPAL_NEST_IMC_START, }; +/* Operation argument to Core IMC */ +enum { + OPAL_CORE_IMC_DISABLE, + OPAL_CORE_IMC_ENABLE, + OPAL_CORE_IMC_INIT, +}; + #endif /* __ASSEMBLY__ */ #endif /* __OPAL_API_H */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index d93d08204243..c4baa6d32037 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -229,6 +229,8 @@ int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr); int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1, uint64_t value2, uint64_t value3); +int64_t opal_core_imc_counters_control(uint64_t operation, uint64_t addr, + uint64_t value2, uint64_t value3); /* Internal functions */ extern int early_init_dt_scan_opal(unsigned long node, const char *uname, diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index 728c67e139e0..45f9b35142a7 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -1,5 +1,5 @@ /* - * Nest Performance Monitor counter support. + * IMC Performance Monitor counter support. * * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation. * (C) 2016 Hemant K Shaw, IBM Corporation. @@ -19,6 +19,15 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; static cpumask_t nest_imc_cpumask; +/* + * Maintains base addresses for all the cores. + * MAX chip and core are defined as 32. So we + * statically allocate 8K for this structure. + * + * TODO -- Could be made dynamic + */ +static u64 per_core_pdbar_add[IMC_MAX_CHIPS][IMC_MAX_CORES]; +static cpumask_t core_imc_cpumask; s
[PATCH v6 05/11] powerpc/perf: Generic imc pmu event functions
From: Hemant Kumar Since, the IMC counters' data are periodically fed to a memory location, the functions to read/update, start/stop, add/del can be generic and can be used by all IMC PMU units. This patch adds a set of generic imc pmu related event functions to be used by each imc pmu unit. Add code to setup format attribute and to register imc pmus. Add a event_init function for nest_imc events. Signed-off-by: Anju T Sudhakar Signed-off-by: Hemant Kumar Signed-off-by: Madhavan Srinivasan --- arch/powerpc/include/asm/imc-pmu.h| 1 + arch/powerpc/perf/imc-pmu.c | 137 ++ arch/powerpc/platforms/powernv/opal-imc.c | 30 ++- 3 files changed, 164 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index a3d4f1bf9492..e13f51047522 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -65,4 +65,5 @@ struct imc_pmu { #define IMC_DOMAIN_NEST1 #define IMC_DOMAIN_UNKNOWN -1 +int imc_get_domain(struct device_node *pmu_dev); #endif /* PPC_POWERNV_IMC_PMU_DEF_H */ diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index ec464c76b749..bd5bf66bd920 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -18,6 +18,132 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; +/* Needed for sanity check */ +extern u64 nest_max_offset; + +PMU_FORMAT_ATTR(event, "config:0-20"); +static struct attribute *imc_format_attrs[] = { + &format_attr_event.attr, + NULL, +}; + +static struct attribute_group imc_format_group = { + .name = "format", + .attrs = imc_format_attrs, +}; + +static int nest_imc_event_init(struct perf_event *event) +{ + int chip_id; + u32 config = event->attr.config; + struct perchip_nest_info *pcni; + + if (event->attr.type != event->pmu->type) + return -ENOENT; + + /* Sampling not supported */ + if (event->hw.sample_period) + return -EINVAL; + + /* unsupported modes and filters */ + if (event->attr.exclude_user || + event->attr.exclude_kernel || + event->attr.exclude_hv || + event->attr.exclude_idle || + event->attr.exclude_host || + event->attr.exclude_guest) + return -EINVAL; + + if (event->cpu < 0) + return -EINVAL; + + /* Sanity check for config (event offset) */ + if (config > nest_max_offset) + return -EINVAL; + + chip_id = topology_physical_package_id(event->cpu); + pcni = &nest_perchip_info[chip_id]; + + /* +* Memory for Nest HW counter data could be in multiple pages. +* Hence check and pick the right base page from the event offset, +* and then add to it. +*/ + event->hw.event_base = pcni->vbase[config/PAGE_SIZE] + + (config & ~PAGE_MASK); + + return 0; +} + +static void imc_read_counter(struct perf_event *event) +{ + u64 *addr, data; + + addr = (u64 *)event->hw.event_base; + data = __be64_to_cpu(READ_ONCE(*addr)); + local64_set(&event->hw.prev_count, data); +} + +static void imc_perf_event_update(struct perf_event *event) +{ + u64 counter_prev, counter_new, final_count, *addr; + + addr = (u64 *)event->hw.event_base; + counter_prev = local64_read(&event->hw.prev_count); + counter_new = __be64_to_cpu(READ_ONCE(*addr)); + final_count = counter_new - counter_prev; + + local64_set(&event->hw.prev_count, counter_new); + local64_add(final_count, &event->count); +} + +static void imc_event_start(struct perf_event *event, int flags) +{ + /* +* In Memory Counters are free flowing counters. HW or the microcode +* keeps adding to the counter offset in memory. To get event +* counter value, we snapshot the value here and we calculate +* delta at later point. +*/ + imc_read_counter(event); +} + +static void imc_event_stop(struct perf_event *event, int flags) +{ + /* +* Take a snapshot and calculate the delta and update +* the event counter values. +*/ + imc_perf_event_update(event); +} + +static int imc_event_add(struct perf_event *event, int flags) +{ + if (flags & PERF_EF_START) + imc_event_start(event, flags); + + return 0; +} + +/* update_pmu_ops : Populate the appropriate operations for "pmu" */ +static int update_pmu_ops(struct imc_pmu *pmu) +{ + if (!pmu) + return -EINVAL; + + pmu->pmu.task_ctx_nr = perf_invalid_context; + pmu->pmu.event_init = nest_imc_event_init; + pmu->pmu.add = imc_event_add; + pmu->pmu.del = imc_event_stop; + pmu->pmu.start = imc_event_start;
[PATCH v6 04/11] powerpc/perf: Add event attribute and group to IMC pmus
From: Hemant Kumar Device tree IMC driver code parses the IMC units and their events. It passes the information to IMC pmu code which is placed in powerpc/perf as "imc-pmu.c". This patch creates only event attributes and attribute groups for the IMC pmus. Signed-off-by: Anju T Sudhakar Signed-off-by: Hemant Kumar Signed-off-by: Madhavan Srinivasan --- arch/powerpc/perf/Makefile| 6 +- arch/powerpc/perf/imc-pmu.c | 97 +++ arch/powerpc/platforms/powernv/opal-imc.c | 12 +++- 3 files changed, 112 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/perf/imc-pmu.c diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile index 4d606b99a5cb..d0d1f04203c7 100644 --- a/arch/powerpc/perf/Makefile +++ b/arch/powerpc/perf/Makefile @@ -2,10 +2,14 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror obj-$(CONFIG_PERF_EVENTS) += callchain.o perf_regs.o +imc-$(CONFIG_PPC_POWERNV) += imc-pmu.o + obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o obj64-$(CONFIG_PPC_PERF_CTRS) += power4-pmu.o ppc970-pmu.o power5-pmu.o \ power5+-pmu.o power6-pmu.o power7-pmu.o \ - isa207-common.o power8-pmu.o power9-pmu.o + isa207-common.o power8-pmu.o power9-pmu.o \ + $(imc-y) + obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c new file mode 100644 index ..ec464c76b749 --- /dev/null +++ b/arch/powerpc/perf/imc-pmu.c @@ -0,0 +1,97 @@ +/* + * Nest Performance Monitor counter support. + * + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation. + * (C) 2016 Hemant K Shaw, IBM Corporation. + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ +#include +#include +#include +#include +#include +#include +#include + +struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; + +/* dev_str_attr : Populate event "name" and string "str" in attribute */ +static struct attribute *dev_str_attr(const char *name, const char *str) +{ + struct perf_pmu_events_attr *attr; + + attr = kzalloc(sizeof(*attr), GFP_KERNEL); + + sysfs_attr_init(&attr->attr.attr); + + attr->event_str = str; + attr->attr.attr.name = name; + attr->attr.attr.mode = 0444; + attr->attr.show = perf_event_sysfs_show; + + return &attr->attr.attr; +} + +/* + * update_events_in_group: Update the "events" information in an attr_group + * and assign the attr_group to the pmu "pmu". + */ +static int update_events_in_group(struct imc_events *events, + int idx, struct imc_pmu *pmu) +{ + struct attribute_group *attr_group; + struct attribute **attrs; + int i; + + /* Allocate memory for attribute group */ + attr_group = kzalloc(sizeof(*attr_group), GFP_KERNEL); + if (!attr_group) + return -ENOMEM; + + /* Allocate memory for attributes */ + attrs = kzalloc((sizeof(struct attribute *) * (idx + 1)), GFP_KERNEL); + if (!attrs) { + kfree(attr_group); + return -ENOMEM; + } + + attr_group->name = "events"; + attr_group->attrs = attrs; + for (i = 0; i < idx; i++, events++) { + attrs[i] = dev_str_attr((char *)events->ev_name, + (char *)events->ev_value); + } + + pmu->attr_groups[0] = attr_group; + return 0; +} + +/* + * init_imc_pmu : Setup the IMC pmu device in "pmu_ptr" and its events + *"events". + * Setup the cpu mask information for these pmus and setup the state machine + * hotplug notifiers as well. + */ +int init_imc_pmu(struct imc_events *events, int idx, +struct imc_pmu *pmu_ptr) +{ + int ret = -ENODEV; + + ret = update_events_in_group(events, idx, pmu_ptr); + if (ret) + goto err_free; + + return 0; + +err_free: + /* Only free the attr_groups which are dynamically allocated */ + if (pmu_ptr->attr_groups[0]) { + kfree(pmu_ptr->attr_groups[0]->attrs); + kfree(pmu_ptr->attr_groups[0]); + } + + return ret; +} diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index ba0203e669c0..73c46703c2af 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -32,8 +32,11 @@ #include #include -struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; -struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; +extern struct perchip_n
[PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events
From: Hemant Kumar Parse device tree to detect IMC units. Traverse through each IMC unit node to find supported events and corresponding unit/scale files (if any). Here is the DTS file for reference: https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts The device tree for IMC counters starts at the node "imc-counters". This node contains all the IMC PMU nodes and event nodes for these IMC PMUs. The PMU nodes have an "events" property which has a phandle value for the actual events node. The events are separated from the PMU nodes to abstract out the common events. For example, PMU node "mcs0", "mcs1" etc. will contain a pointer to "nest-mcs-events" since, the events are common between these PMUs. These events have a different prefix based on their relation to different PMUs, and hence, the PMU nodes themselves contain an "events-prefix" property. The value for this property concatenated to the event name, forms the actual event name. Also, the PMU have a "reg" field as the base offset for the events which belong to this PMU. This "reg" field is added to an event in the "events" node, which gives us the location of the counter data. Kernel code uses this offset as event configuration value. Device tree parser code also looks for scale/unit property in the event node and passes on the value as an event attr for perf interface to use in the post processing by the perf tool. Some PMUs may have common scale and unit properties which implies that all events supported by this PMU inherit the scale and unit properties of the PMU itself. For those events, we need to set the common unit and scale values. For failure to initialize any unit or any event, disable that unit and continue setting up the rest of them. Signed-off-by: Hemant Kumar Signed-off-by: Anju T Sudhakar Signed-off-by: Madhavan Srinivasan --- arch/powerpc/platforms/powernv/opal-imc.c | 386 ++ 1 file changed, 386 insertions(+) diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index c476d596c6a8..ba0203e669c0 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -33,6 +33,388 @@ #include struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; + +static int imc_event_info(char *name, struct imc_events *events) +{ + char *buf; + + /* memory for content */ + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + events->ev_name = name; + events->ev_value = buf; + return 0; +} + +static int imc_event_info_str(struct property *pp, char *name, + struct imc_events *events) +{ + int ret; + + ret = imc_event_info(name, events); + if (ret) + return ret; + + if (!pp->value || (strnlen(pp->value, pp->length) == pp->length) || + (pp->length > IMC_MAX_PMU_NAME_LEN)) + return -EINVAL; + strncpy(events->ev_value, (const char *)pp->value, pp->length); + + return 0; +} + +static int imc_event_info_val(char *name, u32 val, + struct imc_events *events) +{ + int ret; + + ret = imc_event_info(name, events); + if (ret) + return ret; + sprintf(events->ev_value, "event=0x%x", val); + + return 0; +} + +static int set_event_property(struct property *pp, char *event_prop, + struct imc_events *events, char *ev_name) +{ + char *buf; + int ret; + + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + sprintf(buf, "%s.%s", ev_name, event_prop); + ret = imc_event_info_str(pp, buf, events); + if (ret) { + kfree(events->ev_name); + kfree(events->ev_value); + } + + return ret; +} + +/* + * imc_events_node_parser: Parse the event node "dev" and assign the parsed + * information to event "events". + * + * Parses the "reg" property of this event. "reg" gives us the event offset. + * Also, parse the "scale" and "unit" properties, if any. + */ +static int imc_events_node_parser(struct device_node *dev, + struct imc_events *events, + struct property *event_scale, + struct property *event_unit, + struct property *name_prefix, + u32 reg) +{ + struct property *name, *pp; + char *ev_name; + u32 val; + int idx = 0, ret; + + if (!dev) + return -EINVAL; + + /* +* Loop through each property of an event node +*/ + name = of_find_property(dev, "event-name", NULL); + if (!name) + return -ENODEV; + +
[PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module
From: Hemant Kumar This patch does three things : - Enables "opal.c" to create a platform device for the IMC interface according to the appropriate compatibility string. - Find the reserved-memory region details from the system device tree and get the base address of HOMER (Reserved memory) region address for each chip. - We also get the Nest PMU counter data offsets (in the HOMER region) and their sizes. The offsets for the counters' data are fixed and won't change from chip to chip. The device tree parsing logic is separated from the PMU creation functions (which is done in subsequent patches). Signed-off-by: Anju T Sudhakar Signed-off-by: Hemant Kumar Signed-off-by: Madhavan Srinivasan --- arch/powerpc/platforms/powernv/Makefile | 2 +- arch/powerpc/platforms/powernv/opal-imc.c | 126 ++ arch/powerpc/platforms/powernv/opal.c | 14 3 files changed, 141 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/platforms/powernv/opal-imc.c diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile index b5d98cb3f482..44909fec1121 100644 --- a/arch/powerpc/platforms/powernv/Makefile +++ b/arch/powerpc/platforms/powernv/Makefile @@ -2,7 +2,7 @@ obj-y += setup.o opal-wrappers.o opal.o opal-async.o idle.o obj-y += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o obj-y += rng.o opal-elog.o opal-dump.o opal-sysparam.o opal-sensor.o obj-y += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o -obj-y += opal-kmsg.o +obj-y += opal-kmsg.o opal-imc.o obj-$(CONFIG_SMP) += smp.o subcore.o subcore-asm.o obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c new file mode 100644 index ..c476d596c6a8 --- /dev/null +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -0,0 +1,126 @@ +/* + * OPAL IMC interface detection driver + * Supported on POWERNV platform + * + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation. + * (C) 2016 Hemant K Shaw, IBM Corporation. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; + +static int opal_imc_counters_probe(struct platform_device *pdev) +{ + struct device_node *child, *imc_dev, *rm_node = NULL; + struct perchip_nest_info *pcni; + u32 pages, nest_offset, nest_size, idx; + int i = 0; + const char *node_name; + const __be32 *addrp; + u64 reg_addr, reg_size; + + if (!pdev || !pdev->dev.of_node) + return -ENODEV; + + /* +* Check whether this kdump kernel. If yes, just return. +*/ + if (is_kdump_kernel()) + return -ENODEV; + + imc_dev = pdev->dev.of_node; + + /* +* nest_offset : where the nest-counters' data start. +* size : size of the entire nest-counters region +*/ + if (of_property_read_u32(imc_dev, "imc-nest-offset", &nest_offset)) + goto err; + + if (of_property_read_u32(imc_dev, "imc-nest-size", &nest_size)) + goto err; + + /* Find the "homer region" for each chip */ + rm_node = of_find_node_by_path("/reserved-memory"); + if (!rm_node) + goto err; + + for_each_child_of_node(rm_node, child) { + if (of_property_read_string_index(child, "name", 0, + &node_name)) + continue; + if (strncmp("ibm,homer-image", node_name, + strlen("ibm,homer-image"))) + continue; + + /* Get the chip id to which the above homer region belongs to */ + if (of_property_read_u32(child, "ibm,chip-id", &idx)) + goto err; + + pcni = &nest_perchip_info[idx]; + addrp = of_get_address(child, 0, ®_size, NULL); + if (!addrp) + goto err; + + /* Fetch the homer region base address */ + reg_addr = of_read_number(addrp, 2); + pcni->pbase = reg_addr; + /* Add the nest IMC Base offset */ + pcni->pbase =
[PATCH v6 01/11] powerpc/powernv: Data structure and macros definitions
From: Hemant Kumar Create new header file "imc-pmu.h" to add the data structures and macros needed for IMC pmu support. Signed-off-by: Anju T Sudhakar Signed-off-by: Hemant Kumar Signed-off-by: Madhavan Srinivasan --- arch/powerpc/include/asm/imc-pmu.h | 68 ++ 1 file changed, 68 insertions(+) create mode 100644 arch/powerpc/include/asm/imc-pmu.h diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h new file mode 100644 index ..a3d4f1bf9492 --- /dev/null +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -0,0 +1,68 @@ +#ifndef PPC_POWERNV_IMC_PMU_DEF_H +#define PPC_POWERNV_IMC_PMU_DEF_H + +/* + * IMC Nest Performance Monitor counter support. + * + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation. + * (C) 2016 Hemant K Shaw, IBM Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ + +#include +#include +#include +#include +#include + +#define IMC_MAX_CHIPS 32 +#define IMC_MAX_PMUS 32 +#define IMC_MAX_PMU_NAME_LEN 256 + +#define IMC_NEST_MAX_PAGES 16 + +#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters" +#define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest" + +/* + * Structure to hold per chip specific memory address + * information for nest pmus. Nest Counter data are exported + * in per-chip reserved memory region by the PORE Engine. + */ +struct perchip_nest_info { + u32 chip_id; + u64 pbase; + u64 vbase[IMC_NEST_MAX_PAGES]; + u64 size; +}; + +/* + * Place holder for nest pmu events and values. + */ +struct imc_events { + char *ev_name; + char *ev_value; +}; + +/* + * Device tree parser code detects IMC pmu support and + * registers new IMC pmus. This structure will + * hold the pmu functions and attrs for each imc pmu and + * will be referenced at the time of pmu registration. + */ +struct imc_pmu { + struct pmu pmu; + int domain; + const struct attribute_group *attr_groups[4]; +}; + +/* + * Domains for IMC PMUs + */ +#define IMC_DOMAIN_NEST1 +#define IMC_DOMAIN_UNKNOWN -1 + +#endif /* PPC_POWERNV_IMC_PMU_DEF_H */ -- 2.7.4
[PATCH v6 00/11] IMC Instrumentation Support
Power9 has In-Memory-Collection (IMC) infrastructure which contains various Performance Monitoring Units (PMUs) at Nest level (these are on-chip but off-core), Core level and Thread level. The Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC (On-Chip Controller) complex. The microcode collects the counter data and moves the nest IMC counter data to memory. The Core and Thread IMC PMU counters are handled in the core. Core level PMU counters give us the IMC counters' data per core and thread level PMU counters give us the IMC counters' data per CPU thread. This patchset enables the nest IMC, core IMC and thread IMC PMUs and is based on the initial work done by Madhavan Srinivasan. "Nest Instrumentation Support" : https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132078.html v1 for this patchset can be found here : https://lwn.net/Articles/705475/ Nest events: Per-chip nest instrumentation provides various per-chip metrics such as memory, powerbus, Xlink and Alink bandwidth. Core events: Per-core IMC instrumentation provides various per-core metrics such as non-idle cycles, non-idle instructions, various cache and memory related metrics etc. Thread events: All the events for thread level are same as core level with the difference being in the domain. These are per-cpu metrics. PMU Events' Information: OPAL obtains the IMC PMU and event information from the IMC Catalog and passes on to the kernel via the device tree. The events' information contains : - Event name - Event Offset - Event description and, maybe : - Event scale - Event unit Some PMUs may have a common scale and unit values for all their supported events. For those cases, the scale and unit properties for those events must be inherited from the PMU. The event offset in the memory is where the counter data gets accumulated. The OPAL-side patches are posted upstream : https://lists.ozlabs.org/pipermail/skiboot/2017-March/006531.html The kernel discovers the IMC counters information in the device tree at the "imc-counters" device node which has a compatible field "ibm,opal-in-memory-counters". Parsing of the Events' information: To parse the IMC PMUs and events information, the kernel has to discover the "imc-counters" node and walk through the pmu and event nodes. Here is an excerpt of the dt showing the imc-counters with mcs0 (nest), core and thread node: https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts /dts-v1/; [...] /dts-v1/; / { name = ""; compatible = "ibm,opal-in-memory-counters"; #address-cells = <0x1>; #size-cells = <0x1>; imc-nest-offset = <0x32>; imc-nest-size = <0x3>; version-id = ""; NEST_MCS: nest-mcs-events { #address-cells = <0x1>; #size-cells = <0x1>; event@0 { event-name = "RRTO_QFULL_NO_DISP" ; reg = <0x0 0x8>; desc = "RRTO not dispatched in MCS0 due to capacity - pulses once for each time a valid RRTO op is not dispatched due to a command list full condition" ; }; event@8 { event-name = "WRTO_QFULL_NO_DISP" ; reg = <0x8 0x8>; desc = "WRTO not dispatched in MCS0 due to capacity - pulses once for each time a valid WRTO op is not dispatched due to a command list full condition" ; }; [...] mcs0 { compatible = "ibm,imc-counters-nest"; events-prefix = "PM_MCS0_"; unit = ""; scale = ""; reg = <0x118 0x8>; events = < &NEST_MCS >; }; mcs1 { compatible = "ibm,imc-counters-nest"; events-prefix = "PM_MCS1_"; unit = ""; scale = ""; reg = <0x198 0x8>; events = < &NEST_MCS >; }; [...] CORE_EVENTS: core-events { #address-cells = <0x1>; #size-cells = <0x1>; event@e0 { event-name = "0THRD_NON_IDLE_PCYC" ; reg = <0xe0 0x8>; desc = "The number of processor cycles when all threads are idle" ; }; event@120 { event-name = "1THRD_NON_IDLE_PCYC" ; reg = <0x120 0x8>; desc = "The number of processor cycles when exactly one SMT thread is executing non-idle code" ; }; [...] core { compatible = "ibm,imc-counters-core"; events-prefix = "CPM_"; unit = ""; scale = ""; reg = <0x0 0x8>; events = < &CORE_EVENTS >; }; thread { c
Re: [RFC PATCH 2/5] soc/fsl/qbman: Use shared-dma-pool for QMan private memory allocations
On 01/04/17 08:25, Scott Wood wrote: > On Fri, 2017-03-31 at 18:55 +0100, Robin Murphy wrote: >> On 31/03/17 04:27, Michael Ellerman wrote: >>> >>> Robin Murphy writes: >>> Hi Roy, On 29/03/17 22:13, Roy Pledge wrote: > > Use the shared-memory-pool mechanism for frame queue descriptor and > packed frame descriptor record area allocations. Thanks for persevering with this - in my opinion it's now looking like it was worth the effort :) AFAICS the ioremap_wc() that this leads to does appear to give back something non-cacheable on PPC (assuming "pgprot_noncached_wc" isn't horrendously misnamed), and "no-map" should rule out any cacheable linear map alias existing, so it would seem that this approach should avert Scott's concerns about attribute mismatches. >>> How does 'no-map' translate into something being excluded from the >>> linear mapping? >> Reserved regions marked with "no-map" get memblock_remove()d by >> early_init_dt_alloc_reserved_memory_arch(). As I understand things, the >> linear map should only cover memblock areas, and it would be explicitly >> violating the semantics of "no-map" to still cover such a region. > > Discontiguous memory isn't supported on these PPC chips. Everything up to > memblock_end_of_DRAM() gets mapped -- and if that were to change, the > fragmentation would waste TLB1 entries. Ah, so the "PPC-specific angles I'm not aware of" category is indeed non-empty - I guess the lack of HAVE_GENERIC_DMA_COHERENT might be related, then. That said, though, AFAICS only certain x86 and s390 configurations ever call memblock_set_bottom_up(true), so we should be able to assume that the reserved region allocations always fall through to __memblock_find_range_top_down(). Thus if your DRAM is contiguous, then "no-map"-ing the reserved regions will simply end up pushing memblock_end_of_DRAM() down in a manner that would appear to still avoid overlaps. I can only see that going wrong if the end of DRAM wasn't at least 32MB aligned to begin with - is that ever likely to happen in practice? Robin. > This also breaks compatibility with existing device trees. I suggest putting > an ifdef in the qbman driver to add the new scheme for non-PPC arches only. > > -Scott >
Re: [PATCH] powerpc/misc: fix exported functions that reference the TOC
Oliver O'Halloran writes: > When the kernel is compiled to use 64bit ABIv2 the _GLOBAL() macro does not > include a global entry point. A function's global entry point is used when the > function is called from a different TOC context and in the kernel this > typically means a call from a module into the vmlinux (or vis-a-vis). > > There are a few exported ASM functions declared with _GLOBAL() and calling > them from a module will module will likely crash the kernel since any TOC > relative load will yield garbage. > > To fix this use _GLOBAL_TOC() for exported asm functions rather than _GLOBAL() > and some documentation about when to use each. I wonder if we should just change _GLOBAL() to include the global entry point. Persisting with _GLOBAL_TOC() seems like it's just going to be a game of whack-a-mole. Just grepping now I see ~50 functions defined with _GLOBAL() which are also EXPORT'ed. Now presumably none of them use the TOC? But there's no way to verify it, other than inspecting each one, and there's no way to ensure we don't break it again in future. The other option would be just to make a rule that anything EXPORT'ed must use _GLOBAL_TOC(). cheers
Re: [PATCH V3 7/7] cxl: Add psl9 specific code
Le 28/03/2017 à 17:14, Christophe Lombard a écrit : The new Coherent Accelerator Interface Architecture, level 2, for the IBM POWER9 brings new content and features: - POWER9 Service Layer - Registers - Radix mode - Process element entry - Dedicated-Shared Process Programming Model - Translation Fault Handling - CAPP - Memory Context ID If a valid mm_struct is found the memory context id is used for each transaction associated with the process handle. The PSL uses the context ID to find the corresponding process element. Signed-off-by: Christophe Lombard --- drivers/misc/cxl/context.c | 16 ++- drivers/misc/cxl/cxl.h | 137 + drivers/misc/cxl/debugfs.c | 19 drivers/misc/cxl/fault.c | 78 -- drivers/misc/cxl/guest.c | 8 +- drivers/misc/cxl/irq.c | 53 ++ drivers/misc/cxl/native.c | 247 - drivers/misc/cxl/pci.c | 246 +--- drivers/misc/cxl/trace.h | 43 9 files changed, 752 insertions(+), 95 deletions(-) diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c index ac2531e..45363be 100644 --- a/drivers/misc/cxl/context.c +++ b/drivers/misc/cxl/context.c @@ -188,12 +188,24 @@ int cxl_context_iomap(struct cxl_context *ctx, struct vm_area_struct *vma) if (ctx->afu->current_mode == CXL_MODE_DEDICATED) { if (start + len > ctx->afu->adapter->ps_size) return -EINVAL; + + if (cxl_is_psl9(ctx->afu)) { + /* make sure there is a valid problem state +* area space for this AFU +*/ + if (ctx->master && !ctx->afu->psa) { + pr_devel("AFU doesn't support mmio space\n"); + return -EINVAL; + } + + /* Can't mmap until the AFU is enabled */ + if (!ctx->afu->enabled) + return -EBUSY; + } } else { if (start + len > ctx->psn_size) return -EINVAL; - } - if (ctx->afu->current_mode != CXL_MODE_DEDICATED) { /* make sure there is a valid per process space for this AFU */ if ((ctx->master && !ctx->afu->psa) || (!ctx->afu->pp_psa)) { pr_devel("AFU doesn't support mmio space\n"); diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 2f2d9e4..aac0504 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -63,7 +63,7 @@ typedef struct { /* Memory maps. Ref CXL Appendix A */ /* PSL Privilege 1 Memory Map */ -/* Configuration and Control area */ +/* Configuration and Control area - CAIA 1&2 */ static const cxl_p1_reg_t CXL_PSL_CtxTime = {0x}; static const cxl_p1_reg_t CXL_PSL_ErrIVTE = {0x0008}; static const cxl_p1_reg_t CXL_PSL_KEY1= {0x0010}; @@ -98,11 +98,29 @@ static const cxl_p1_reg_t CXL_XSL_Timebase = {0x0100}; static const cxl_p1_reg_t CXL_XSL_TB_CTLSTAT = {0x0108}; static const cxl_p1_reg_t CXL_XSL_FEC = {0x0158}; static const cxl_p1_reg_t CXL_XSL_DSNCTL= {0x0168}; +/* PSL registers - CAIA 2 */ +static const cxl_p1_reg_t CXL_PSL9_CONTROL = {0x0020}; +static const cxl_p1_reg_t CXL_XSL9_DSNCTL = {0x0168}; +static const cxl_p1_reg_t CXL_PSL9_FIR1 = {0x0300}; +static const cxl_p1_reg_t CXL_PSL9_FIR2 = {0x0308}; +static const cxl_p1_reg_t CXL_PSL9_Timebase = {0x0310}; +static const cxl_p1_reg_t CXL_PSL9_DEBUG= {0x0320}; +static const cxl_p1_reg_t CXL_PSL9_FIR_CNTL = {0x0348}; +static const cxl_p1_reg_t CXL_PSL9_DSNDCTL = {0x0350}; +static const cxl_p1_reg_t CXL_PSL9_TB_CTLSTAT = {0x0340}; +static const cxl_p1_reg_t CXL_PSL9_TRACECFG = {0x0368}; +static const cxl_p1_reg_t CXL_PSL9_APCDEDALLOC = {0x0378}; +static const cxl_p1_reg_t CXL_PSL9_APCDEDTYPE = {0x0380}; +static const cxl_p1_reg_t CXL_PSL9_TNR_ADDR = {0x0388}; +static const cxl_p1_reg_t CXL_PSL9_GP_CT = {0x0398}; +static const cxl_p1_reg_t CXL_XSL9_IERAT = {0x0588}; +static const cxl_p1_reg_t CXL_XSL9_ILPP = {0x0590}; + /* 0x7F00:7FFF Reserved PCIe MSI-X Pending Bit Array area */ /* 0x8000: Reserved PCIe MSI-X Table Area */ /* PSL Slice Privilege 1 Memory Map */ -/* Configuration Area */ +/* Configuration Area - CAIA 1&2 */ static const cxl_p1n_reg_t CXL_PSL_SR_An = {0x00}; static const cxl_p1n_reg_t CXL_PSL_LPID_An= {0x08}; static const cxl_p1n_reg_t CXL_PSL_AMBAR_An = {0x10}; @@ -111,17 +129,18 @@ static const cxl_p1n_reg_t CXL_PSL_ID_An = {0x20}; static const cxl_p1n_reg_t CXL_PSL_SERR_An= {0x28}; /* Memory Management and Lookaside Buffer Management - CAIA 1*/ static const cxl_p1n_reg_t CXL_PSL_SDR_An = {0x30}; +/* Memory Management and Lookaside Buffer Management - CAIA 1&2 */ static const cxl_p1n_reg_t CXL_PSL_A
Re: [PATCH V3 6/7] cxl: Isolate few psl8 specific calls
Le 28/03/2017 à 17:14, Christophe Lombard a écrit : Point out the specific Coherent Accelerator Interface Architecture, level 1, registers. Code and functions specific to PSL8 (CAIA1) must be framed. Signed-off-by: Christophe Lombard --- drivers/misc/cxl/context.c | 28 +++- drivers/misc/cxl/cxl.h | 35 +++-- drivers/misc/cxl/debugfs.c | 6 +++-- drivers/misc/cxl/fault.c | 14 +- drivers/misc/cxl/native.c | 58 ++--- drivers/misc/cxl/pci.c | 64 +++--- 6 files changed, 136 insertions(+), 69 deletions(-) diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c index 2e935ea..ac2531e 100644 --- a/drivers/misc/cxl/context.c +++ b/drivers/misc/cxl/context.c @@ -39,23 +39,26 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu *afu, bool master) { int i; - spin_lock_init(&ctx->sste_lock); ctx->afu = afu; ctx->master = master; ctx->pid = NULL; /* Set in start work ioctl */ mutex_init(&ctx->mapping_lock); ctx->mapping = NULL; - /* -* Allocate the segment table before we put it in the IDR so that we -* can always access it when dereferenced from IDR. For the same -* reason, the segment table is only destroyed after the context is -* removed from the IDR. Access to this in the IOCTL is protected by -* Linux filesytem symantics (can't IOCTL until open is complete). -*/ - i = cxl_alloc_sst(ctx); - if (i) - return i; + if (cxl_is_psl8(afu)) { + spin_lock_init(&ctx->sste_lock); + + /* +* Allocate the segment table before we put it in the IDR so that we +* can always access it when dereferenced from IDR. For the same +* reason, the segment table is only destroyed after the context is +* removed from the IDR. Access to this in the IOCTL is protected by +* Linux filesytem symantics (can't IOCTL until open is complete). +*/ + i = cxl_alloc_sst(ctx); + if (i) + return i; + } INIT_WORK(&ctx->fault_work, cxl_handle_fault); @@ -308,7 +311,8 @@ static void reclaim_ctx(struct rcu_head *rcu) { struct cxl_context *ctx = container_of(rcu, struct cxl_context, rcu); - free_page((u64)ctx->sstp); + if (cxl_is_psl8(ctx->afu)) + free_page((u64)ctx->sstp); if (ctx->ff_page) __free_page(ctx->ff_page); ctx->sstp = NULL; diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index c7dd315..2f2d9e4 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h This is very minor, but a lot of the changes in the comments below could probably be merged in the next patch when the CAIA2 version of those registers is introduced. No big deal though. @@ -73,7 +73,7 @@ static const cxl_p1_reg_t CXL_PSL_Control = {0x0020}; static const cxl_p1_reg_t CXL_PSL_DLCNTL = {0x0060}; static const cxl_p1_reg_t CXL_PSL_DLADDR = {0x0068}; -/* PSL Lookaside Buffer Management Area */ +/* PSL Lookaside Buffer Management Area - CAIA 1 */ static const cxl_p1_reg_t CXL_PSL_LBISEL = {0x0080}; static const cxl_p1_reg_t CXL_PSL_SLBIE = {0x0088}; static const cxl_p1_reg_t CXL_PSL_SLBIA = {0x0090}; @@ -82,7 +82,7 @@ static const cxl_p1_reg_t CXL_PSL_TLBIA = {0x00A8}; static const cxl_p1_reg_t CXL_PSL_AFUSEL = {0x00B0}; /* 0x00C0:7EFF Implementation dependent area */ -/* PSL registers */ +/* PSL registers - CAIA 1 */ static const cxl_p1_reg_t CXL_PSL_FIR1 = {0x0100}; static const cxl_p1_reg_t CXL_PSL_FIR2 = {0x0108}; static const cxl_p1_reg_t CXL_PSL_Timebase = {0x0110}; @@ -109,7 +109,7 @@ static const cxl_p1n_reg_t CXL_PSL_AMBAR_An = {0x10}; static const cxl_p1n_reg_t CXL_PSL_SPOffset_An= {0x18}; static const cxl_p1n_reg_t CXL_PSL_ID_An = {0x20}; static const cxl_p1n_reg_t CXL_PSL_SERR_An= {0x28}; -/* Memory Management and Lookaside Buffer Management */ +/* Memory Management and Lookaside Buffer Management - CAIA 1*/ static const cxl_p1n_reg_t CXL_PSL_SDR_An = {0x30}; static const cxl_p1n_reg_t CXL_PSL_AMOR_An= {0x38}; /* Pointer Area */ @@ -124,6 +124,7 @@ static const cxl_p1n_reg_t CXL_PSL_IVTE_Limit_An = {0xB8}; /* 0xC0:FF Implementation Dependent Area */ static const cxl_p1n_reg_t CXL_PSL_FIR_SLICE_An = {0xC0}; static const cxl_p1n_reg_t CXL_AFU_DEBUG_An = {0xC8}; +/* 0xC0:FF Implementation Dependent Area - CAIA 1 */ static const cxl_p1n_reg_t CXL_PSL_APCALLOC_A = {0xD0}; static const cxl_p1n_reg_t CXL_PSL_COALLOC_A = {0xD8}; static const cxl_p1n_reg_t CXL_PSL_RXCTL_A= {0xE0}; @@ -133,12 +134,14 @@ static const cxl_p1n_reg_t CXL_PSL_SLICE_TRACE= {0xE8}; /* Configuration and
Re: [PATCH V3 5/7] cxl: Rename some psl8 specific functions
Le 28/03/2017 à 17:14, Christophe Lombard a écrit : Rename a few functions, changing the '_psl' suffix to '_psl8', to make clear that the implementation is psl8 specific. Those functions will have an equivalent implementation for the psl9 in a later patch. Signed-off-by: Christophe Lombard --- Acked-by: Frederic Barrat drivers/misc/cxl/cxl.h | 26 +- drivers/misc/cxl/debugfs.c | 6 +++--- drivers/misc/cxl/guest.c | 2 +- drivers/misc/cxl/irq.c | 2 +- drivers/misc/cxl/native.c | 12 ++-- drivers/misc/cxl/pci.c | 46 +++--- 6 files changed, 47 insertions(+), 47 deletions(-) diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 65d9fb6..c7dd315 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -813,10 +813,10 @@ int afu_register_irqs(struct cxl_context *ctx, u32 count); void afu_release_irqs(struct cxl_context *ctx, void *cookie); void afu_irq_name_free(struct cxl_context *ctx); -int cxl_attach_afu_directed_psl(struct cxl_context *ctx, u64 wed, u64 amr); -int cxl_activate_dedicated_process_psl(struct cxl_afu *afu); -int cxl_attach_dedicated_process_psl(struct cxl_context *ctx, u64 wed, u64 amr); -void cxl_update_dedicated_ivtes_psl(struct cxl_context *ctx); +int cxl_attach_afu_directed_psl8(struct cxl_context *ctx, u64 wed, u64 amr); +int cxl_activate_dedicated_process_psl8(struct cxl_afu *afu); +int cxl_attach_dedicated_process_psl8(struct cxl_context *ctx, u64 wed, u64 amr); +void cxl_update_dedicated_ivtes_psl8(struct cxl_context *ctx); #ifdef CONFIG_DEBUG_FS @@ -826,10 +826,10 @@ int cxl_debugfs_adapter_add(struct cxl *adapter); void cxl_debugfs_adapter_remove(struct cxl *adapter); int cxl_debugfs_afu_add(struct cxl_afu *afu); void cxl_debugfs_afu_remove(struct cxl_afu *afu); -void cxl_stop_trace_psl(struct cxl *cxl); -void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir); +void cxl_stop_trace_psl8(struct cxl *cxl); +void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry *dir); void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir); -void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir); +void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct dentry *dir); #else /* CONFIG_DEBUG_FS */ @@ -860,11 +860,11 @@ static inline void cxl_debugfs_afu_remove(struct cxl_afu *afu) { } -static inline void cxl_stop_trace(struct cxl *cxl) +static inline void cxl_stop_trace_psl8(struct cxl *cxl) { } -static inline void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, +static inline void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry *dir) { } @@ -874,7 +874,7 @@ static inline void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, { } -static inline void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir) +static inline void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct dentry *dir) { } @@ -932,8 +932,8 @@ struct cxl_irq_info { }; void cxl_assign_psn_space(struct cxl_context *ctx); -int cxl_invalidate_all_psl(struct cxl *adapter); -irqreturn_t cxl_irq_psl(int irq, struct cxl_context *ctx, struct cxl_irq_info *irq_info); +int cxl_invalidate_all_psl8(struct cxl *adapter); +irqreturn_t cxl_irq_psl8(int irq, struct cxl_context *ctx, struct cxl_irq_info *irq_info); irqreturn_t cxl_fail_irq_psl(struct cxl_afu *afu, struct cxl_irq_info *irq_info); int cxl_register_one_irq(struct cxl *adapter, irq_handler_t handler, void *cookie, irq_hw_number_t *dest_hwirq, @@ -945,7 +945,7 @@ int cxl_data_cache_flush(struct cxl *adapter); int cxl_afu_disable(struct cxl_afu *afu); int cxl_psl_purge(struct cxl_afu *afu); -void cxl_native_irq_dump_regs_psl(struct cxl_context *ctx); +void cxl_native_irq_dump_regs_psl8(struct cxl_context *ctx); void cxl_native_err_irq_dump_regs(struct cxl *adapter); int cxl_pci_vphb_add(struct cxl_afu *afu); void cxl_pci_vphb_remove(struct cxl_afu *afu); diff --git a/drivers/misc/cxl/debugfs.c b/drivers/misc/cxl/debugfs.c index 4848ebf..2ff10a9 100644 --- a/drivers/misc/cxl/debugfs.c +++ b/drivers/misc/cxl/debugfs.c @@ -15,7 +15,7 @@ static struct dentry *cxl_debugfs; -void cxl_stop_trace_psl(struct cxl *adapter) +void cxl_stop_trace_psl8(struct cxl *adapter) { int slice; @@ -53,7 +53,7 @@ static struct dentry *debugfs_create_io_x64(const char *name, umode_t mode, (void __force *)value, &fops_io_x64); } -void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir) +void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry *dir) { debugfs_create_io_x64("fir1", S_IRUSR, dir, _cxl_p1_addr(adapter, CXL_PSL_FIR1)); debugfs_create_io_x64("fir2", S_IRUSR, dir, _cxl_p1_addr(adapter, CXL_PSL_FIR2)); @@ -9
Re: [PATCH V3 4/7] cxl: Update implementation service layer
Le 28/03/2017 à 17:14, Christophe Lombard a écrit : The service layer API (in cxl.h) lists some low-level functions whose implementation is different on PSL8, PSL9 and XSL: - Init implementation for the adapter and the afu. - Invalidate TLB/SLB. - Attach process for dedicated/directed models. - Handle psl interrupts. - Debug registers for the adapter and the afu. - Traces. Each environment implements its own functions, and the common code uses them through function pointers, defined in cxl_service_layer_ops. Signed-off-by: Christophe Lombard --- drivers/misc/cxl/cxl.h | 40 --- drivers/misc/cxl/debugfs.c | 16 +++--- drivers/misc/cxl/guest.c | 2 +- drivers/misc/cxl/irq.c | 2 +- drivers/misc/cxl/native.c | 52 -- drivers/misc/cxl/pci.c | 47 - 6 files changed, 102 insertions(+), 57 deletions(-) diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index d8282e4..65d9fb6 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -553,13 +553,23 @@ struct cxl_context { struct mm_struct *mm; }; +struct cxl_irq_info; + struct cxl_service_layer_ops { int (*adapter_regs_init)(struct cxl *adapter, struct pci_dev *dev); + int (*invalidate_all)(struct cxl *adapter); int (*afu_regs_init)(struct cxl_afu *afu); + int (*sanitise_afu_regs)(struct cxl_afu *afu); int (*register_serr_irq)(struct cxl_afu *afu); void (*release_serr_irq)(struct cxl_afu *afu); - void (*debugfs_add_adapter_sl_regs)(struct cxl *adapter, struct dentry *dir); - void (*debugfs_add_afu_sl_regs)(struct cxl_afu *afu, struct dentry *dir); + irqreturn_t (*handle_interrupt)(int irq, struct cxl_context *ctx, struct cxl_irq_info *irq_info); + irqreturn_t (*fail_irq)(struct cxl_afu *afu, struct cxl_irq_info *irq_info); + int (*activate_dedicated_process)(struct cxl_afu *afu); + int (*attach_afu_directed)(struct cxl_context *ctx, u64 wed, u64 amr); + int (*attach_dedicated_process)(struct cxl_context *ctx, u64 wed, u64 amr); + void (*update_dedicated_ivtes)(struct cxl_context *ctx); + void (*debugfs_add_adapter_regs)(struct cxl *adapter, struct dentry *dir); + void (*debugfs_add_afu_regs)(struct cxl_afu *afu, struct dentry *dir); Some of those callbacks should be defined for the xsl_ops structure (Mellanox Cx4). We don't really support CX4 any more, we'll probably remove the code once CX5 is upstream, but for the time being, we should keep it around. void (*psl_irq_dump_registers)(struct cxl_context *ctx); void (*err_irq_dump_registers)(struct cxl *adapter); void (*debugfs_stop_trace)(struct cxl *adapter); @@ -803,6 +813,11 @@ int afu_register_irqs(struct cxl_context *ctx, u32 count); void afu_release_irqs(struct cxl_context *ctx, void *cookie); void afu_irq_name_free(struct cxl_context *ctx); +int cxl_attach_afu_directed_psl(struct cxl_context *ctx, u64 wed, u64 amr); +int cxl_activate_dedicated_process_psl(struct cxl_afu *afu); +int cxl_attach_dedicated_process_psl(struct cxl_context *ctx, u64 wed, u64 amr); +void cxl_update_dedicated_ivtes_psl(struct cxl_context *ctx); + #ifdef CONFIG_DEBUG_FS int cxl_debugfs_init(void); @@ -811,10 +826,10 @@ int cxl_debugfs_adapter_add(struct cxl *adapter); void cxl_debugfs_adapter_remove(struct cxl *adapter); int cxl_debugfs_afu_add(struct cxl_afu *afu); void cxl_debugfs_afu_remove(struct cxl_afu *afu); -void cxl_stop_trace(struct cxl *cxl); -void cxl_debugfs_add_adapter_psl_regs(struct cxl *adapter, struct dentry *dir); -void cxl_debugfs_add_adapter_xsl_regs(struct cxl *adapter, struct dentry *dir); -void cxl_debugfs_add_afu_psl_regs(struct cxl_afu *afu, struct dentry *dir); +void cxl_stop_trace_psl(struct cxl *cxl); +void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir); +void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir); +void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir); #else /* CONFIG_DEBUG_FS */ @@ -849,17 +864,17 @@ static inline void cxl_stop_trace(struct cxl *cxl) { } -static inline void cxl_debugfs_add_adapter_psl_regs(struct cxl *adapter, +static inline void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir) { } -static inline void cxl_debugfs_add_adapter_xsl_regs(struct cxl *adapter, +static inline void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir) { } -static inline void cxl_debugfs_add_afu_psl_regs(struct cxl_afu *afu, struct dentry *dir) +static inline void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir) { } @@ -917,19 +932,20 @@ struct cxl_irq_info { }; void cxl_assign_psn_space(struct cxl_context *ctx); -
Re: [PATCH V3 3/7] cxl: Keep track of mm struct associated with a context
Le 28/03/2017 à 17:14, Christophe Lombard a écrit : The mm_struct corresponding to the current task is acquired each time an interrupt is raised. So to simplify the code, we only get the mm_struct when attaching an AFU context to the process. The mm_count reference is increased to ensure that the mm_struct can't be freed. The mm_struct will be released when the context is detached. A reference on mm_users is not kept to avoid a circular dependency if the process mmaps its cxl mmio and forget to unmap before exiting. Commit msg should mention that we also remove glpid because it's no longer useful now that we track the mm_struct. Another detail in cxl.h below Signed-off-by: Christophe Lombard --- drivers/misc/cxl/api.c | 17 +-- drivers/misc/cxl/context.c | 21 +++-- drivers/misc/cxl/cxl.h | 13 ++-- drivers/misc/cxl/fault.c | 76 -- drivers/misc/cxl/file.c| 15 +++-- drivers/misc/cxl/main.c| 12 ++-- 6 files changed, 64 insertions(+), 90 deletions(-) diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c index bcc030e..1a138c8 100644 --- a/drivers/misc/cxl/api.c +++ b/drivers/misc/cxl/api.c @@ -14,6 +14,7 @@ #include #include #include +#include #include "cxl.h" @@ -321,19 +322,29 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed, if (task) { ctx->pid = get_task_pid(task, PIDTYPE_PID); - ctx->glpid = get_task_pid(task->group_leader, PIDTYPE_PID); kernel = false; ctx->real_mode = false; + + /* acquire a reference to the task's mm */ + ctx->mm = get_task_mm(current); + + /* ensure this mm_struct can't be freed */ + cxl_context_mm_count_get(ctx); + + /* decrement the use count */ + if (ctx->mm) + mmput(ctx->mm); } cxl_ctx_get(); if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) { - put_pid(ctx->glpid); put_pid(ctx->pid); - ctx->glpid = ctx->pid = NULL; + ctx->pid = NULL; cxl_adapter_context_put(ctx->afu->adapter); cxl_ctx_put(); + if (task) + cxl_context_mm_count_put(ctx); goto out; } diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c index 062bf6c..2e935ea 100644 --- a/drivers/misc/cxl/context.c +++ b/drivers/misc/cxl/context.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -41,7 +42,7 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu *afu, bool master) spin_lock_init(&ctx->sste_lock); ctx->afu = afu; ctx->master = master; - ctx->pid = ctx->glpid = NULL; /* Set in start work ioctl */ + ctx->pid = NULL; /* Set in start work ioctl */ mutex_init(&ctx->mapping_lock); ctx->mapping = NULL; @@ -242,12 +243,16 @@ int __detach_context(struct cxl_context *ctx) /* release the reference to the group leader and mm handling pid */ put_pid(ctx->pid); - put_pid(ctx->glpid); cxl_ctx_put(); /* Decrease the attached context count on the adapter */ cxl_adapter_context_put(ctx->afu->adapter); + + /* Decrease the mm count on the context */ + cxl_context_mm_count_put(ctx); + ctx->mm = NULL; + return 0; } @@ -325,3 +330,15 @@ void cxl_context_free(struct cxl_context *ctx) mutex_unlock(&ctx->afu->contexts_lock); call_rcu(&ctx->rcu, reclaim_ctx); } + +void cxl_context_mm_count_get(struct cxl_context *ctx) +{ + if (ctx->mm) + atomic_inc(&ctx->mm->mm_count); +} + +void cxl_context_mm_count_put(struct cxl_context *ctx) +{ + if (ctx->mm) + mmdrop(ctx->mm); +} diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 2bbe077..d8282e4 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -482,8 +482,6 @@ struct cxl_context { unsigned int sst_size, sst_lru; wait_queue_head_t wq; - /* pid of the group leader associated with the pid */ - struct pid *glpid; /* use mm context associated with this pid for ds faults */ struct pid *pid; spinlock_t lock; /* Protects pending_irq_mask, pending_fault and fault_addr */ @@ -551,6 +549,8 @@ struct cxl_context { * CX4 only: */ struct list_head extra_irq_contexts; + + struct mm_struct *mm; }; struct cxl_service_layer_ops { @@ -1025,4 +1025,13 @@ int cxl_adapter_context_lock(struct cxl *adapter); /* Unlock the contexts-lock if taken. Warn and force unlock otherwise */ void cxl_adapter_context_unlock(struct cxl *adapter); +/* Increases the reference count to "struct mm_struct" */ +void cxl_context_mm_count_get(struct cxl_context *ctx); + +/* Decreme
Re: [PATCH V3 2/7] cxl: Remove unused values in bare-metal environment.
Le 28/03/2017 à 17:14, Christophe Lombard a écrit : The two fields pid and tid of the structure cxl_irq_info are only used in the guest environment. To avoid confusion, it's not necessary to fill the fields in the bare-metal environment. These two fields are renamed to 'reserved' to avoid undefined behavior on bare-metal. The PSL Process and Thread Identification Register (CXL_PSL_PID_TID_An) is only used when attaching a dedicated process for PSL8 only. This register goes away in CAIA2. Signed-off-by: Christophe Lombard --- drivers/misc/cxl/cxl.h| 13 +++-- drivers/misc/cxl/hcalls.c | 4 ++-- drivers/misc/cxl/native.c | 5 - 3 files changed, 9 insertions(+), 13 deletions(-) diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 79e60ec..2bbe077 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -895,19 +895,20 @@ int __detach_context(struct cxl_context *ctx); * plpar_hcall9() in hvCall.S * As a consequence: * - we don't need to do any endianness conversion - * - the pid and tid are an exception. They are 32-bit values returned in - * the same 64-bit register. So we do need to worry about byte ordering. + * - the pid (reserved0) and tid (reserved1) are an exception. They are + * 32-bit values returned in the same 64-bit register. So we do need + * to worry about byte ordering. */ struct cxl_irq_info { u64 dsisr; u64 dar; u64 dsr; #ifndef CONFIG_CPU_LITTLE_ENDIAN - u32 pid; - u32 tid; + u32 reserved0; + u32 reserved1; #else - u32 tid; - u32 pid; + u32 reserved1; + u32 reserved0; #endif The resulting structure looks weird/over-complicated. I think we could get rid of the #ifdef about endianess by just declaring: struct cxl_irq_info { u64 dsisr; u64 dar; u64 dsr; u64 pid_tid u64 afu_err; ... and just print the 64-bit value in hcalls.c and let the reader figure out the order. It's just informational anyway. Fred u64 afu_err; u64 errstat; diff --git a/drivers/misc/cxl/hcalls.c b/drivers/misc/cxl/hcalls.c index d6d11f4..142a095 100644 --- a/drivers/misc/cxl/hcalls.c +++ b/drivers/misc/cxl/hcalls.c @@ -414,8 +414,8 @@ long cxl_h_collect_int_info(u64 unit_address, u64 process_token, switch (rc) { case H_SUCCESS: /* The interrupt info is returned in return registers. */ pr_devel("dsisr:%#llx, dar:%#llx, dsr:%#llx, pid:%u, tid:%u, afu_err:%#llx, errstat:%#llx\n", - info->dsisr, info->dar, info->dsr, info->pid, - info->tid, info->afu_err, info->errstat); + info->dsisr, info->dar, info->dsr, info->reserved0, + info->reserved1, info->afu_err, info->errstat); return 0; case H_PARAMETER: /* An incorrect parameter was supplied. */ return -EINVAL; diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c index 7ae7105..7257e8b 100644 --- a/drivers/misc/cxl/native.c +++ b/drivers/misc/cxl/native.c @@ -859,8 +859,6 @@ static int native_detach_process(struct cxl_context *ctx) static int native_get_irq_info(struct cxl_afu *afu, struct cxl_irq_info *info) { - u64 pidtid; - /* If the adapter has gone away, we can't get any meaningful * information. */ @@ -870,9 +868,6 @@ static int native_get_irq_info(struct cxl_afu *afu, struct cxl_irq_info *info) info->dsisr = cxl_p2n_read(afu, CXL_PSL_DSISR_An); info->dar = cxl_p2n_read(afu, CXL_PSL_DAR_An); info->dsr = cxl_p2n_read(afu, CXL_PSL_DSR_An); - pidtid = cxl_p2n_read(afu, CXL_PSL_PID_TID_An); - info->pid = pidtid >> 32; - info->tid = pidtid & 0x; info->afu_err = cxl_p2n_read(afu, CXL_AFU_ERR_An); info->errstat = cxl_p2n_read(afu, CXL_PSL_ErrStat_An); info->proc_handle = 0;
Re: [PATCH V3 1/7] cxl: Read vsec perst load image
Le 28/03/2017 à 17:14, Christophe Lombard a écrit : This bit is used to cause a flash image load for programmable CAIA-compliant implementation. If this bit is set to ‘0’, a power cycle of the adapter is required to load a programmable CAIA-com- pliant implementation from flash. This field will be used by the following patches. Signed-off-by: Christophe Lombard --- Acked-by: Frederic Barrat drivers/misc/cxl/pci.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c index 91f6459..e82a207 100644 --- a/drivers/misc/cxl/pci.c +++ b/drivers/misc/cxl/pci.c @@ -1332,6 +1332,7 @@ static int cxl_read_vsec(struct cxl *adapter, struct pci_dev *dev) CXL_READ_VSEC_IMAGE_STATE(dev, vsec, &image_state); adapter->user_image_loaded = !!(image_state & CXL_VSEC_USER_IMAGE_LOADED); adapter->perst_select_user = !!(image_state & CXL_VSEC_USER_IMAGE_LOADED); + adapter->perst_loads_image = !!(image_state & CXL_VSEC_PERST_LOADS_IMAGE); CXL_READ_VSEC_NAFUS(dev, vsec, &adapter->slices); CXL_READ_VSEC_AFU_DESC_OFF(dev, vsec, &afu_desc_off);
Re: [PATCH] raid6/altivec: adding vpermxor implementation for raid6 Q syndrome
Hi Matt, > The raid6 Q syndrome check has been optimised using the vpermxor > instruction. This instruction was made available with POWER8, ISA version > 2.07. It allows for both vperm and vxor instructions to be done in a single > instruction. This has been tested for correctness on a ppc64le vm with a > basic RAID6 setup containing 5 drives. > > The performance benchmarks are from the raid6test in the /lib/raid6/test > directory. These results are from an IBM Firestone machine with ppc64le > architecture. The benchmark results show a 35% speed increase over the best > existing algorithm for powerpc (altivec). The raid6test has also been run > on a big-endian ppc64 vm to ensure it also works for big-endian > architectures. > > Performance benchmarks: > > raid6: altivecx4 gen() 18773 MB/s > raid6: altivecx8 gen() 19438 MB/s > > raid6: vpermxor4 gen() 25112 MB/s > raid6: vpermxor8 gen() 26279 MB/s > Well done! > Note: Also fixed a small bug in pq.h regarding a missing and mismatched > ifdef statement So this could be slightly better explained: 4f8c55c5ad49 ("lib/raid6: build proper files on corresponding arch") breaks raid6test on ppc by rearranging the ifdefs incorrectly. You could possibly split that fix out into its own separate commit, but that is probably overkill. You should, however, probably include just before your Signed-off-by: Fixes: 4f8c55c5ad49 ("lib/raid6: build proper files on corresponding arch") > Signed-off-by: Matt Brown > --- > include/linux/raid/pq.h | 4 ++ > lib/raid6/Makefile | 27 - > lib/raid6/algos.c | 4 ++ > lib/raid6/altivec.uc| 3 ++ > lib/raid6/test/Makefile | 28 - > lib/raid6/vpermxor.uc | 102 > > 6 files changed, 165 insertions(+), 3 deletions(-) > create mode 100644 lib/raid6/vpermxor.uc > > diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h > index 4d57bba..3df9aa6 100644 > --- a/include/linux/raid/pq.h > +++ b/include/linux/raid/pq.h > @@ -107,6 +107,10 @@ extern const struct raid6_calls raid6_avx512x2; > extern const struct raid6_calls raid6_avx512x4; > extern const struct raid6_calls raid6_tilegx8; > extern const struct raid6_calls raid6_s390vx8; > +extern const struct raid6_calls raid6_vpermxor1; > +extern const struct raid6_calls raid6_vpermxor2; > +extern const struct raid6_calls raid6_vpermxor4; > +extern const struct raid6_calls raid6_vpermxor8; > > struct raid6_recov_calls { > void (*data2)(int, size_t, int, int, void **); > diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile > index 3057011..7775aad 100644 > --- a/lib/raid6/Makefile > +++ b/lib/raid6/Makefile > @@ -4,7 +4,8 @@ raid6_pq-y+= algos.o recov.o tables.o int1.o int2.o > int4.o \ > int8.o int16.o int32.o > > raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o > avx2.o avx512.o recov_avx512.o > -raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o > +raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \ > + vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o > raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o > raid6_pq-$(CONFIG_TILEGX) += tilegx8.o > raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o > @@ -88,6 +89,30 @@ $(obj)/altivec8.c: UNROLL := 8 > $(obj)/altivec8.c: $(src)/altivec.uc $(src)/unroll.awk FORCE > $(call if_changed,unroll) > > +CFLAGS_vpermxor1.o += $(altivec_flags) > +targets += vpermxor1.c > +$(obj)/vpermxor1.c: UNROLL := 1 > +$(obj)/vpermxor1.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE > + $(call if_changed,unroll) > + > +CFLAGS_vpermxor2.o += $(altivec_flags) > +targets += vpermxor2.c > +$(obj)/vpermxor2.c: UNROLL := 2 > +$(obj)/vpermxor2.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE > + $(call if_changed,unroll) > + > +CFLAGS_vpermxor4.o += $(altivec_flags) > +targets += vpermxor4.c > +$(obj)/vpermxor4.c: UNROLL := 4 > +$(obj)/vpermxor4.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE > + $(call if_changed,unroll) > + > +CFLAGS_vpermxor8.o += $(altivec_flags) > +targets += vpermxor8.c > +$(obj)/vpermxor8.c: UNROLL := 8 > +$(obj)/vpermxor8.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE > + $(call if_changed,unroll) > + > CFLAGS_neon1.o += $(NEON_FLAGS) > targets += neon1.c > $(obj)/neon1.c: UNROLL := 1 > diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c > index 7857049..edd4f69 100644 > --- a/lib/raid6/algos.c > +++ b/lib/raid6/algos.c > @@ -74,6 +74,10 @@ const struct raid6_calls * const raid6_algos[] = { > &raid6_altivec2, > &raid6_altivec4, > &raid6_altivec8, > + &raid6_vpermxor1, > + &raid6_vpermxor2, > + &raid6_vpermxor4, > + &raid6_vpermxor8, > #endif > #if defined(CONFIG_TILEGX) > &raid6_tilegx8, > diff --git a/lib/raid6/altivec.uc b/lib/raid6/altivec.uc > index 682aae8..d20ed0d 100644 > --- a/lib/r
[PATCH] KVM: PPC: Book3S PR: Do not always inject facility unavailable exceptions
KVM should not inject a facility unavailable exception into the guest when it tries to execute a mtspr/mfspr instruction for an SPR that is unavailable, and the vCPU is *not* running in PRoblem state. It's right that we inject an exception when the vCPU is in PR mode, since chapter "6.2.10 Facility Status and Control Register" of the PowerISA v2.07 says that "When the FSCR makes a facility unavailable, attempted usage of the facility in *problem state* is treated as follows: [...] Access of an SPR using mfspr/mtspr causes a Facility Unavailable interrupt". But if the guest vCPU is not in PR mode, we should follow the behavior that is described in chapter "4.4.4 Move To/From System Register Instructions" instead and treat the instruction as a NOP. Signed-off-by: Thomas Huth --- arch/powerpc/kvm/book3s_pr.c | 28 +++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index d4dfc0c..3e6c0b3 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -830,6 +831,31 @@ static void kvmppc_emulate_fac(struct kvm_vcpu *vcpu, ulong fac) } } +static void kvmppc_handle_fac_not_enabled(struct kvm_vcpu *vcpu, ulong fac) +{ + enum emulation_result er; + u32 inst; + + if (kvmppc_get_msr(vcpu) & MSR_PR) { + kvmppc_trigger_fac_interrupt(vcpu, fac); + return; + } + + er = kvmppc_get_last_inst(vcpu, INST_GENERIC, &inst); + if (er != EMULATE_DONE) + return; + + if (get_op(inst) == 31 && (get_xop(inst) == OP_31_XOP_MTSPR || + get_xop(inst) == OP_31_XOP_MFSPR)) { + /* mtspr and mfspr are treated as NOP for unsupported SPRs */ + kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4); + pr_debug_ratelimited("%s: write/read of disabled SPR: %d\n", +__func__, get_sprn(inst)); + } else { + kvmppc_trigger_fac_interrupt(vcpu, fac); + } +} + /* Enable facilities (TAR, EBB, DSCR) for the guest */ static int kvmppc_handle_fac(struct kvm_vcpu *vcpu, ulong fac) { @@ -855,7 +881,7 @@ static int kvmppc_handle_fac(struct kvm_vcpu *vcpu, ulong fac) if (!guest_fac_enabled) { /* Facility not enabled by the guest */ - kvmppc_trigger_fac_interrupt(vcpu, fac); + kvmppc_handle_fac_not_enabled(vcpu, fac); return RESUME_GUEST; } -- 1.8.3.1
[PATCH] KVM: PPC: Book3S PR: Do not fail emulation with mtspr/mfspr for unknown SPRs
According to the PowerISA 2.07, mtspr and mfspr should not generate an illegal instruction exception when being used with an undefined SPR, but rather treat the instruction as a NOP, inject a privilege exception or an emulation assistance exception - depending on the SPR number. Also turn the printk here into a ratelimited print statement, so that the guest can not flood the dmesg log of the host by issueing lots of illegal mtspr/mfspr instruction here. Signed-off-by: Thomas Huth --- arch/powerpc/kvm/book3s.c | 1 + arch/powerpc/kvm/book3s_emulate.c | 30 ++ 2 files changed, 23 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index b6b5c18..9b007f9 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -137,6 +137,7 @@ void kvmppc_inject_interrupt(struct kvm_vcpu *vcpu, int vec, u64 flags) kvmppc_set_pc(vcpu, kvmppc_interrupt_offset(vcpu) + vec); vcpu->arch.mmu.reset_msr(vcpu); } +EXPORT_SYMBOL_GPL(kvmppc_inject_interrupt); static int kvmppc_book3s_vec2irqprio(unsigned int vec) { diff --git a/arch/powerpc/kvm/book3s_emulate.c b/arch/powerpc/kvm/book3s_emulate.c index 8359752..9c31e23 100644 --- a/arch/powerpc/kvm/book3s_emulate.c +++ b/arch/powerpc/kvm/book3s_emulate.c @@ -503,10 +503,16 @@ int kvmppc_core_emulate_mtspr_pr(struct kvm_vcpu *vcpu, int sprn, ulong spr_val) break; unprivileged: default: - printk(KERN_INFO "KVM: invalid SPR write: %d\n", sprn); -#ifndef DEBUG_SPR - emulated = EMULATE_FAIL; -#endif + pr_info_ratelimited("KVM: invalid SPR write: %d\n", sprn); + if (sprn & 0x10) { + if (kvmppc_get_msr(vcpu) & MSR_PR) + kvmppc_core_queue_program(vcpu, SRR1_PROGPRIV); + } else { + if ((kvmppc_get_msr(vcpu) & MSR_PR) || sprn == 0) + kvmppc_inject_interrupt(vcpu, + BOOK3S_INTERRUPT_H_EMUL_ASSIST, + 0); + } break; } @@ -648,10 +654,18 @@ int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu, int sprn, ulong *spr_val break; default: unprivileged: - printk(KERN_INFO "KVM: invalid SPR read: %d\n", sprn); -#ifndef DEBUG_SPR - emulated = EMULATE_FAIL; -#endif + pr_info_ratelimited("KVM: invalid SPR read: %d\n", sprn); + if (sprn & 0x10) { + if (kvmppc_get_msr(vcpu) & MSR_PR) + kvmppc_core_queue_program(vcpu, SRR1_PROGPRIV); + } else { + if ((kvmppc_get_msr(vcpu) & MSR_PR) || sprn == 0 || + sprn == 4 || sprn == 5 || sprn == 6) + kvmppc_inject_interrupt(vcpu, + BOOK3S_INTERRUPT_H_EMUL_ASSIST, + 0); + } + break; } -- 1.8.3.1
Re: [v6, 01/19] powerpc/mm/slice: Convert slice_mask high slice to a bitmap
On Thu, 2017-03-30 at 12:03:49 UTC, Michael Ellerman wrote: > From: "Aneesh Kumar K.V" > > In followup patch we want to increase the va range which will result > in us requiring high_slices to have more than 64 bits. To enable this > convert high_slices to bitmap. We keep the number bits same in this patch > and later change that to higher value > > Signed-off-by: Aneesh Kumar K.V > [mpe: Fold in fix to use bitmap_empty()] > Signed-off-by: Michael Ellerman Series applied to powerpc next. https://git.kernel.org/powerpc/c/f3207c124e7aa8d4d9cf32cc45b10c cheers
Re: [V3,01/10] powerpc/mm/nohash: MM_SLICE is only used by book3s 64
On Tue, 2017-03-21 at 17:29:51 UTC, "Aneesh Kumar K.V" wrote: > BOOKE code is dead code as per the Kconfig details. So make it simpler > by enabling MM_SLICE only for book3s_64. The changes w.r.t nohash is just > removing deadcode. W.r.t ppc64, 4k without hugetlb will now enable MM_SLICE. > But that is good, because we reduce one extra variant which probably is not > getting tested much. > > Reviewed-by: Paul Mackerras > Signed-off-by: Aneesh Kumar K.V Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/b42279f0165cdfb093718ec368e56d cheers
Re: [v2, 1/2] powerpc/fadump: reserve memory at an offset closer to bottom of RAM
On Thu, 2017-03-16 at 21:05:26 UTC, Hari Bathini wrote: > Currently, the area to preserve boot memory is reserved at the top of > RAM. This leaves fadump vulnerable to memory hot-remove operations. As > memory for fadump has to be reserved early in the boot process, fadump > can't be registered after a memory hot-remove operation. Though this > problem can't be eleminated completely, the impact can be minimized by > reserving memory at an offset closer to bottom of the RAM. The offset > for fadump memory reservation can be any value greater than fadump boot > memory size. > > Signed-off-by: Hari Bathini Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/f6e6bedb773118713a8eb3736a4c99 cheers
Re: powerpc/4xx: make sam440ep_setup_rtc init
On Tue, 2016-04-26 at 16:49:38 UTC, yshi wrote: > sam440ep_setup_rtc is just called by machine_device_initcall, and it calls > i2c_register_board_info which is init too, so the lack of __init may cause > mismatch warning when linking kernel. > > Signed-off-by: Yang Shi Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/adec9a2e7ca3b4193818f9a7b39563 cheers
[PATCH v2 3/3] powerpc/powernv: Introduce address translation services for Nvlink2
Nvlink2 supports address translation services (ATS) allowing devices to request address translations from an mmu known as the nest MMU which is setup to walk the CPU page tables. To access this functionality certain firmware calls are required to setup and manage hardware context tables in the nvlink processing unit (NPU). The NPU also manages forwarding of TLB invalidates (known as address translation shootdowns/ATSDs) to attached devices. This patch exports several methods to allow device drivers to register a process id (PASID/PID) in the hardware tables and to receive notification of when a device should stop issuing address translation requests (ATRs). It also adds a fault handler to allow device drivers to demand fault pages in. Signed-off-by: Alistair Popple --- Changes since v1: - Moved exported function call prototypes to an externally accessable header. - Fixed invalidation of the NMMU PWC (this also fixes the reported build failure). arch/powerpc/include/asm/book3s/64/mmu.h | 6 + arch/powerpc/include/asm/opal-api.h| 5 +- arch/powerpc/include/asm/opal.h| 5 + arch/powerpc/include/asm/powernv.h | 22 ++ arch/powerpc/mm/mmu_context_book3s64.c | 1 + arch/powerpc/platforms/powernv/npu-dma.c | 443 + arch/powerpc/platforms/powernv/opal-wrappers.S | 3 + arch/powerpc/platforms/powernv/pci-ioda.c | 2 + arch/powerpc/platforms/powernv/pci.h | 15 +- 9 files changed, 500 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h index 805d4105..1676ec8 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu.h +++ b/arch/powerpc/include/asm/book3s/64/mmu.h @@ -73,10 +73,16 @@ extern struct patb_entry *partition_tb; typedef unsigned long mm_context_id_t; struct spinlock; +/* Maximum possible number of NPUs in a system. */ +#define NV_MAX_NPUS 8 + typedef struct { mm_context_id_t id; u16 user_psize; /* page size index */ + /* NPU NMMU context */ + struct npu_context *npu_context; + #ifdef CONFIG_PPC_MM_SLICES u64 low_slices_psize; /* SLB page size encodings */ unsigned char high_slices_psize[SLICE_ARRAY_SIZE]; diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index a0aa285..a599a2c 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -168,7 +168,10 @@ #define OPAL_INT_SET_MFRR 125 #define OPAL_PCI_TCE_KILL 126 #define OPAL_NMMU_SET_PTCR 127 -#define OPAL_LAST 127 +#define OPAL_NPU_INIT_CONTEXT 146 +#define OPAL_NPU_DESTROY_CONTEXT 147 +#define OPAL_NPU_MAP_LPAR 148 +#define OPAL_LAST 148 /* Device tree flags */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 1ff03a6..b3b97c4 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -29,6 +29,11 @@ extern struct device_node *opal_node; /* API functions */ int64_t opal_invalid_call(void); +int64_t opal_npu_destroy_context(uint64_t phb_id, uint64_t pid, uint64_t bdf); +int64_t opal_npu_init_context(uint64_t phb_id, int pasid, uint64_t msr, + uint64_t bdf); +int64_t opal_npu_map_lpar(uint64_t phb_id, uint64_t bdf, uint64_t lparid, + uint64_t lpcr); int64_t opal_console_write(int64_t term_number, __be64 *length, const uint8_t *buffer); int64_t opal_console_read(int64_t term_number, __be64 *length, diff --git a/arch/powerpc/include/asm/powernv.h b/arch/powerpc/include/asm/powernv.h index 0e9c240..f627977 100644 --- a/arch/powerpc/include/asm/powernv.h +++ b/arch/powerpc/include/asm/powernv.h @@ -11,9 +11,31 @@ #define _ASM_POWERNV_H #ifdef CONFIG_PPC_POWERNV +#define NPU2_WRITE 1 extern void powernv_set_nmmu_ptcr(unsigned long ptcr); +extern struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev, + unsigned long flags, + struct npu_context *(*cb)(struct npu_context *, void *), + void *priv); +extern void pnv_npu2_destroy_context(struct npu_context *context, + struct pci_dev *gpdev); +extern int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea, + unsigned long *flags, unsigned long *status, + int count); #else static inline void powernv_set_nmmu_ptcr(unsigned long ptcr) { } +static inline struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev, + unsigned long flags, + struct npu_context *(*cb)(struct npu_context *, void *), + void *priv) { return ERR_PTR(-ENODEV);
[PATCH v2 2/3] powerpc/powernv: Add sanity checks to pnv_pci_get_{gpu|npu}_dev
The pnv_pci_get_{gpu|npu}_dev functions are used to find associations between nvlink PCIe devices and standard PCIe devices. However they lacked basic sanity checking which results in NULL pointer dereferencing if they are incorrect called can be harder to spot than an explicit WARN_ON. Signed-off-by: Alistair Popple --- arch/powerpc/platforms/powernv/npu-dma.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 1c383f3..050bd5d 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -37,6 +37,12 @@ struct pci_dev *pnv_pci_get_gpu_dev(struct pci_dev *npdev) struct device_node *dn; struct pci_dev *gpdev; + if (WARN_ON(!npdev)) + return NULL; + + if (WARN_ON(!npdev->dev.of_node)) + return NULL; + /* Get assoicated PCI device */ dn = of_parse_phandle(npdev->dev.of_node, "ibm,gpu", 0); if (!dn) @@ -55,6 +61,12 @@ struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index) struct device_node *dn; struct pci_dev *npdev; + if (WARN_ON(!gpdev)) + return NULL; + + if (WARN_ON(!gpdev->dev.of_node)) + return NULL; + /* Get assoicated PCI device */ dn = of_parse_phandle(gpdev->dev.of_node, "ibm,npu", index); if (!dn) -- 2.1.4
[PATCH v2 1/3] drivers/of/base.c: Add of_property_read_u64_index
There is of_property_read_u32_index but no u64 variant. This patch adds one similar to the u32 version for u64. Signed-off-by: Alistair Popple Acked-by: Rob Herring --- drivers/of/base.c | 31 +++ include/linux/of.h | 3 +++ 2 files changed, 34 insertions(+) diff --git a/drivers/of/base.c b/drivers/of/base.c index d7c4629..0ea16bd 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -1213,6 +1213,37 @@ int of_property_read_u32_index(const struct device_node *np, EXPORT_SYMBOL_GPL(of_property_read_u32_index); /** + * of_property_read_u64_index - Find and read a u64 from a multi-value property. + * + * @np:device node from which the property value is to be read. + * @propname: name of the property to be searched. + * @index: index of the u64 in the list of values + * @out_value: pointer to return value, modified only if no error. + * + * Search for a property in a device node and read nth 64-bit value from + * it. Returns 0 on success, -EINVAL if the property does not exist, + * -ENODATA if property does not have a value, and -EOVERFLOW if the + * property data isn't large enough. + * + * The out_value is modified only if a valid u64 value can be decoded. + */ +int of_property_read_u64_index(const struct device_node *np, + const char *propname, + u32 index, u64 *out_value) +{ + const u64 *val = of_find_property_value_of_size(np, propname, + ((index + 1) * sizeof(*out_value)), + 0, NULL); + + if (IS_ERR(val)) + return PTR_ERR(val); + + *out_value = be64_to_cpup(((__be64 *)val) + index); + return 0; +} +EXPORT_SYMBOL_GPL(of_property_read_u64_index); + +/** * of_property_read_variable_u8_array - Find and read an array of u8 from a * property, with bounds on the minimum and maximum array size. * diff --git a/include/linux/of.h b/include/linux/of.h index 21e6323..d08788d 100644 --- a/include/linux/of.h +++ b/include/linux/of.h @@ -292,6 +292,9 @@ extern int of_property_count_elems_of_size(const struct device_node *np, extern int of_property_read_u32_index(const struct device_node *np, const char *propname, u32 index, u32 *out_value); +extern int of_property_read_u64_index(const struct device_node *np, + const char *propname, + u32 index, u64 *out_value); extern int of_property_read_variable_u8_array(const struct device_node *np, const char *propname, u8 *out_values, size_t sz_min, size_t sz_max); -- 2.1.4
Re: [PowerPC][next-20170324][kselftest] kernel Oops when running tm/tm-signal-context-chk-vsx
On Mon, 2017-04-03 at 14:28 +0530, Abdul Haleem wrote: > On Tue, 2017-03-28 at 21:00 +1100, Michael Ellerman wrote: > > Abdul Haleem writes: > > > > > Hi, > > > > > > While running kernel self tests on ppc64, tm/tm-signal-context-chk-vsx > > > tests fails with Oops message. > > > > > > I was able to reproduce only twice out of 20 runs on next-20170324 only. > > > so it is difficult to bisect the commit causing the issue. > > > > Can you try mainline as of this commit: > > > > 605df8d674ac ("selftests/powerpc: Replace stxvx and lxvx with > > stxvd2x/lxvd2x") > > > > https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=605df8d674ac65e044a0bf4998b28c2f350b7f9e > > > > cheers > > > > > Trace is not reproducible on mainline with above commit. > > Cyril was able to reproduce it and is working on it. > > > > cc Cyril Bur -- Regard's Abdul Haleem IBM Linux Technology Centre
Re: [PowerPC][next-20170324][kselftest] kernel Oops when running tm/tm-signal-context-chk-vsx
On Tue, 2017-03-28 at 21:00 +1100, Michael Ellerman wrote: > Abdul Haleem writes: > > > Hi, > > > > While running kernel self tests on ppc64, tm/tm-signal-context-chk-vsx > > tests fails with Oops message. > > > > I was able to reproduce only twice out of 20 runs on next-20170324 only. > > so it is difficult to bisect the commit causing the issue. > > Can you try mainline as of this commit: > > 605df8d674ac ("selftests/powerpc: Replace stxvx and lxvx with stxvd2x/lxvd2x") > > https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=605df8d674ac65e044a0bf4998b28c2f350b7f9e > > cheers > Trace is not reproducible on mainline with above commit. Cyril was able to reproduce it and is working on it. -- Regard's Abdul Haleem IBM Linux Technology Centre
[PATCH] powerpc/mm: remove stale comment
The code to fix the problem it describes was removed in c40785a and it uses the stupid comment style. Away it goes! Signed-off-by: Oliver O'Halloran --- arch/powerpc/mm/hash_utils_64.c | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 8848fec..69a05b3 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -927,11 +927,6 @@ static void __init htab_initialize(void) } #endif /* CONFIG_DEBUG_PAGEALLOC */ - /* On U3 based machines, we need to reserve the DART area and -* _NOT_ map it to avoid cache paradoxes as it's remapped non -* cacheable later on -*/ - /* create bolted the linear mapping in the hash table */ for_each_memblock(memory, reg) { base = (unsigned long)__va(reg->base); -- 2.9.3