Re: [PATCH] powerpc/perf_event: Fix oops due to perf_event_do_pending call
On Wed, 2010-04-14 at 16:46 +1000, Paul Mackerras wrote: > Ben, please put this in your tree of fixes to go to Linus for 2.6.34, > since it fixes a potential panic. Should it go to -stable as well ? How far back ? Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/perf_event: Fix oops due to perf_event_do_pending call
Anton Blanchard found that large POWER systems would occasionally crash in the exception exit path when profiling with perf_events. The symptom was that an interrupt would occur late in the exit path when the MSR[RI] (recoverable interrupt) bit was clear. Interrupts should be hard-disabled at this point but they were enabled. Because the interrupt was not recoverable the system panicked. The reason is that the exception exit path was calling perf_event_do_pending after hard-disabling interrupts, and perf_event_do_pending will re-enable interrupts. The simplest and cleanest fix for this is to use the same mechanism that 32-bit powerpc does, namely to cause a self-IPI by setting the decrementer to 1. This means we can remove the tests in the exception exit path and raw_local_irq_restore. This also makes sure that the call to perf_event_do_pending from timer_interrupt() happens within irq_enter/irq_exit. (Note that calling perf_event_do_pending from timer_interrupt does not mean that there is a possible 1/HZ latency; setting the decrementer to 1 ensures that the timer interrupt will happen immediately, i.e. within one timebase tick, which is a few nanoseconds or 10s of nanoseconds.) Signed-off-by: Paul Mackerras Cc: sta...@kernel.org --- Ben, please put this in your tree of fixes to go to Linus for 2.6.34, since it fixes a potential panic. arch/powerpc/include/asm/hw_irq.h | 38 arch/powerpc/kernel/asm-offsets.c |1 arch/powerpc/kernel/entry_64.S|9 - arch/powerpc/kernel/irq.c |6 --- arch/powerpc/kernel/time.c| 60 ++ 5 files changed, 48 insertions(+), 66 deletions(-) diff --git a/arch/powerpc/include/asm/hw_irq.h b/arch/powerpc/include/asm/hw_irq.h index 9f4c9d4..bd100fc 100644 --- a/arch/powerpc/include/asm/hw_irq.h +++ b/arch/powerpc/include/asm/hw_irq.h @@ -130,43 +130,5 @@ static inline int irqs_disabled_flags(unsigned long flags) */ struct irq_chip; -#ifdef CONFIG_PERF_EVENTS - -#ifdef CONFIG_PPC64 -static inline unsigned long test_perf_event_pending(void) -{ - unsigned long x; - - asm volatile("lbz %0,%1(13)" - : "=r" (x) - : "i" (offsetof(struct paca_struct, perf_event_pending))); - return x; -} - -static inline void set_perf_event_pending(void) -{ - asm volatile("stb %0,%1(13)" : : - "r" (1), - "i" (offsetof(struct paca_struct, perf_event_pending))); -} - -static inline void clear_perf_event_pending(void) -{ - asm volatile("stb %0,%1(13)" : : - "r" (0), - "i" (offsetof(struct paca_struct, perf_event_pending))); -} -#endif /* CONFIG_PPC64 */ - -#else /* CONFIG_PERF_EVENTS */ - -static inline unsigned long test_perf_event_pending(void) -{ - return 0; -} - -static inline void clear_perf_event_pending(void) {} -#endif /* CONFIG_PERF_EVENTS */ - #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_HW_IRQ_H */ diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 957ceb7..c09138d 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -133,7 +133,6 @@ int main(void) DEFINE(PACAKMSR, offsetof(struct paca_struct, kernel_msr)); DEFINE(PACASOFTIRQEN, offsetof(struct paca_struct, soft_enabled)); DEFINE(PACAHARDIRQEN, offsetof(struct paca_struct, hard_enabled)); - DEFINE(PACAPERFPEND, offsetof(struct paca_struct, perf_event_pending)); DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id)); #ifdef CONFIG_PPC_MM_SLICES DEFINE(PACALOWSLICESPSIZE, offsetof(struct paca_struct, diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 07109d8..42e9d90 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -556,15 +556,6 @@ ALT_FW_FTR_SECTION_END_IFCLR(FW_FEATURE_ISERIES) 2: TRACE_AND_RESTORE_IRQ(r5); -#ifdef CONFIG_PERF_EVENTS - /* check paca->perf_event_pending if we're enabling ints */ - lbz r3,PACAPERFPEND(r13) - and.r3,r3,r5 - beq 27f - bl .perf_event_do_pending -27: -#endif /* CONFIG_PERF_EVENTS */ - /* extract EE bit and use it to restore paca->hard_enabled */ ld r3,_MSR(r1) rldicl r4,r3,49,63 /* r0 = (r3 >> 15) & 1 */ diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index 64f6f20..066bd31 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -53,7 +53,6 @@ #include #include #include -#include #include #include @@ -145,11 +144,6 @@ notrace void raw_local_irq_restore(unsigned long en) } #endif /* CONFIG_PPC_STD_MMU_64 */ - if (test_perf_event_pending()) { - clear_perf_event_pending(); - perf_event_do_pending(); - } - /* * if (get_paca()->hard_enabled) return; * But again we need t
Re: [PATCH 2/5] sched: add asymmetric packing option for sibling domain
In message <1271161767.4807.1281.ca...@twins> you wrote: > On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote: > > Peter: Since this is based mainly off your initial patch, it should > > have your signed-off-by too, but I didn't want to add without your > > permission. Can I add it? > > Of course! :-) > > This thing does need a better changelog though, and maybe a larger > comment with check_asym_packing(), explaining why and what we're doing > and what we're assuming (that lower cpu number also means lower thread > number). > OK, updated patch below... Mikey [PATCH 2/5] sched: add asymmetric group packing option for sibling domain Check to see if the group is packed in a sched doman. This is primarily intended to used at the sibling level. Some cores like POWER7 prefer to use lower numbered SMT threads. In the case of POWER7, it can move to lower SMT modes only when higher threads are idle. When in lower SMT modes, the threads will perform better since they share less core resources. Hence when we have idle threads, we want them to be the higher ones. This adds a hook into f_b_g() called check_asym_packing() to check the packing. This packing function is run on idle threads. It checks to see if the busiest CPU in this domain (core in the P7 case) has a higher CPU number than what where the packing function is being run on. If it is, calculate the imbalance and return the higher busier thread as the busiest group to f_b_g(). Here we are assuming a lower CPU number will be equivalent to a lower SMT thread number. It also creates a new SD_ASYM_PACKING flag to enable this feature at any scheduler domain level. It also creates an arch hook to enable this feature at the sibling level. The default function doesn't enable this feature. Based heavily on patch from Peter Zijlstra. Signed-off-by: Michael Neuling Signed-off-by: Peter Zijlstra --- include/linux/sched.h|4 +- include/linux/topology.h |1 kernel/sched_fair.c | 93 +-- 3 files changed, 94 insertions(+), 4 deletions(-) Index: linux-2.6-ozlabs/include/linux/sched.h === --- linux-2.6-ozlabs.orig/include/linux/sched.h +++ linux-2.6-ozlabs/include/linux/sched.h @@ -799,7 +799,7 @@ enum cpu_idle_type { #define SD_POWERSAVINGS_BALANCE0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ - +#define SD_ASYM_PACKING0x0800 /* Place busy groups earlier in the domain */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ enum powersavings_balance_level { @@ -834,6 +834,8 @@ static inline int sd_balance_for_package return SD_PREFER_SIBLING; } +extern int __weak arch_sd_sibiling_asym_packing(void); + /* * Optimise SD flags for power savings: * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings. Index: linux-2.6-ozlabs/include/linux/topology.h === --- linux-2.6-ozlabs.orig/include/linux/topology.h +++ linux-2.6-ozlabs/include/linux/topology.h @@ -102,6 +102,7 @@ int arch_update_cpu_topology(void); | 1*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE\ | 0*SD_PREFER_SIBLING \ + | arch_sd_sibiling_asym_packing() \ , \ .last_balance = jiffies, \ .balance_interval = 1,\ Index: linux-2.6-ozlabs/kernel/sched_fair.c === --- linux-2.6-ozlabs.orig/kernel/sched_fair.c +++ linux-2.6-ozlabs/kernel/sched_fair.c @@ -2493,6 +2493,39 @@ static inline void update_sg_lb_stats(st } /** + * update_sd_pick_busiest - return 1 on busiest group + * @sd: sched_domain whose statistics are to be checked + * @sds: sched_domain statistics + * @sg: sched_group candidate to be checked for being the busiest + * @sds: sched_group statistics + * + * This returns 1 for the busiest group. If asymmetric packing is + * enabled and we already have a busiest, but this candidate group has + * a higher cpu number than the current busiest, pick this sg. + */ +static int update_sd_pick_busiest(struct sched_domain *sd, + struct sd_lb_stats *sds, + struct sched_group *sg, + struct sg_lb_stats *sgs) +{ + if (sgs->sum_nr_running > sgs->group_capacity) + return 1; + + if (sgs->group_imb) + return 1;
[PATCH v2 2/2] perf probe: Add PowerPC DWARF register number mappings
From: Ian Munsie This patch adds mappings from the register numbers from DWARF to the register names used in the PowerPC Regs and Stack Access API. This allows perf probe to be used to record variable contents on PowerPC. This patch depends on functionality in the powerpc/next tree, though it will compile fine without it. Specifically this patch depends on commit "powerpc: Add kprobe-based event tracer" Signed-off-by: Ian Munsie --- Changes since v1: Removed arch specific arch_dwarf-regs.h and defined PERF_HAVE_DWARF_REGS in the arch specific Makefile. tools/perf/arch/powerpc/Makefile |2 + tools/perf/arch/powerpc/util/dwarf-regs.c | 88 + 2 files changed, 90 insertions(+), 0 deletions(-) create mode 100644 tools/perf/arch/powerpc/Makefile create mode 100644 tools/perf/arch/powerpc/util/dwarf-regs.c diff --git a/tools/perf/arch/powerpc/Makefile b/tools/perf/arch/powerpc/Makefile new file mode 100644 index 000..cbd7833 --- /dev/null +++ b/tools/perf/arch/powerpc/Makefile @@ -0,0 +1,2 @@ +PERF_HAVE_DWARF_REGS := 1 +LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/dwarf-regs.o diff --git a/tools/perf/arch/powerpc/util/dwarf-regs.c b/tools/perf/arch/powerpc/util/dwarf-regs.c new file mode 100644 index 000..48ae0c5 --- /dev/null +++ b/tools/perf/arch/powerpc/util/dwarf-regs.c @@ -0,0 +1,88 @@ +/* + * Mapping of DWARF debug register numbers into register names. + * + * Copyright (C) 2010 Ian Munsie, IBM Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include +#include + + +struct pt_regs_dwarfnum { + const char *name; + unsigned int dwarfnum; +}; + +#define STR(s) #s +#define REG_DWARFNUM_NAME(r, num) {.name = r, .dwarfnum = num} +#define GPR_DWARFNUM_NAME(num) \ + {.name = STR(%gpr##num), .dwarfnum = num} +#define REG_DWARFNUM_END {.name = NULL, .dwarfnum = 0} + +/* + * Reference: + * http://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.9.html + */ +static const struct pt_regs_dwarfnum regdwarfnum_table[] = { + GPR_DWARFNUM_NAME(0), + GPR_DWARFNUM_NAME(1), + GPR_DWARFNUM_NAME(2), + GPR_DWARFNUM_NAME(3), + GPR_DWARFNUM_NAME(4), + GPR_DWARFNUM_NAME(5), + GPR_DWARFNUM_NAME(6), + GPR_DWARFNUM_NAME(7), + GPR_DWARFNUM_NAME(8), + GPR_DWARFNUM_NAME(9), + GPR_DWARFNUM_NAME(10), + GPR_DWARFNUM_NAME(11), + GPR_DWARFNUM_NAME(12), + GPR_DWARFNUM_NAME(13), + GPR_DWARFNUM_NAME(14), + GPR_DWARFNUM_NAME(15), + GPR_DWARFNUM_NAME(16), + GPR_DWARFNUM_NAME(17), + GPR_DWARFNUM_NAME(18), + GPR_DWARFNUM_NAME(19), + GPR_DWARFNUM_NAME(20), + GPR_DWARFNUM_NAME(21), + GPR_DWARFNUM_NAME(22), + GPR_DWARFNUM_NAME(23), + GPR_DWARFNUM_NAME(24), + GPR_DWARFNUM_NAME(25), + GPR_DWARFNUM_NAME(26), + GPR_DWARFNUM_NAME(27), + GPR_DWARFNUM_NAME(28), + GPR_DWARFNUM_NAME(29), + GPR_DWARFNUM_NAME(30), + GPR_DWARFNUM_NAME(31), + REG_DWARFNUM_NAME("%msr", 66), + REG_DWARFNUM_NAME("%ctr", 109), + REG_DWARFNUM_NAME("%link", 108), + REG_DWARFNUM_NAME("%xer", 101), + REG_DWARFNUM_NAME("%dar", 119), + REG_DWARFNUM_NAME("%dsisr", 118), + REG_DWARFNUM_END, +}; + +/** + * get_arch_regstr() - lookup register name from it's DWARF register number + * @n: the DWARF register number + * + * get_arch_regstr() returns the name of the register in struct + * regdwarfnum_table from it's DWARF register number. If the register is not + * found in the table, this returns NULL; + */ +const char *get_arch_regstr(unsigned int n) +{ + const struct pt_regs_dwarfnum *roff; + for (roff = regdwarfnum_table; roff->name != NULL; roff++) + if (roff->dwarfnum == n) + return roff->name; + return NULL; +} -- 1.7.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 1/2] perf: Move arch specific code into separate arch directory
From: Ian Munsie The perf userspace tool included some architecture specific code to map registers from the DWARF register number into the names used by the regs and stack access API. This patch moves the architecture specific code out into a separate arch/x86 directory along with the infrastructure required to use it. Signed-off-by: Ian Munsie --- Changes since v1: From Masami Hiramatsu's suggestion, I added a check in the Makefile for if the arch specific Makefile defines PERF_HAVE_DWARF_REGS, printing a message during build if it has not. This simplifies the code removing the odd macro from the previous version and the need for an arch specific arch_dwarf-regs.h. I have not entirely disabled DWARF support for architectures that don't implement the register mappings, so that they can still add a probe based on a line number (they will be missing the ability to capture the value of a variable from a register). tools/perf/Makefile | 27 ++- tools/perf/arch/x86/Makefile |2 + tools/perf/arch/x86/util/dwarf-regs.c | 75 + tools/perf/util/include/dwarf-regs.h |6 +++ tools/perf/util/probe-finder.c| 56 +++-- 5 files changed, 113 insertions(+), 53 deletions(-) create mode 100644 tools/perf/arch/x86/Makefile create mode 100644 tools/perf/arch/x86/util/dwarf-regs.c create mode 100644 tools/perf/util/include/dwarf-regs.h diff --git a/tools/perf/Makefile b/tools/perf/Makefile index 57b3569..7f71daf 100644 --- a/tools/perf/Makefile +++ b/tools/perf/Makefile @@ -173,6 +173,20 @@ uname_R := $(shell sh -c 'uname -r 2>/dev/null || echo not') uname_P := $(shell sh -c 'uname -p 2>/dev/null || echo not') uname_V := $(shell sh -c 'uname -v 2>/dev/null || echo not') +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/ -e s/sun4u/sparc64/ \ + -e s/arm.*/arm/ -e s/sa110/arm/ \ + -e s/s390x/s390/ -e s/parisc64/parisc/ \ + -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \ + -e s/sh[234].*/sh/ ) + +# Additional ARCH settings for x86 +ifeq ($(ARCH),i386) +ARCH := x86 +endif +ifeq ($(ARCH),x86_64) +ARCH := x86 +endif + # CFLAGS and LDFLAGS are for the users to override from the command line. # @@ -285,7 +299,7 @@ endif # Those must not be GNU-specific; they are shared with perl/ which may # be built by a different compiler. (Note that this is an artifact now # but it still might be nice to keep that distinction.) -BASIC_CFLAGS = -Iutil/include +BASIC_CFLAGS = -Iutil/include -Iarch/$(ARCH)/include BASIC_LDFLAGS = # Guard against environment variables @@ -367,6 +381,7 @@ LIB_H += util/include/asm/byteorder.h LIB_H += util/include/asm/swab.h LIB_H += util/include/asm/system.h LIB_H += util/include/asm/uaccess.h +LIB_H += util/include/dwarf-regs.h LIB_H += perf.h LIB_H += util/cache.h LIB_H += util/callchain.h @@ -485,6 +500,7 @@ PERFLIBS = $(LIB_FILE) -include config.mak.autogen -include config.mak +-include arch/$(ARCH)/Makefile ifeq ($(uname_S),Darwin) ifndef NO_FINK @@ -525,8 +541,13 @@ ifndef NO_DWARF BASIC_CFLAGS += -I/usr/include/elfutils -DDWARF_SUPPORT EXTLIBS += -lelf -ldw LIB_OBJS += $(OUTPUT)util/probe-finder.o -endif -endif +ifneq ($(origin PERF_HAVE_DWARF_REGS), undefined) + BASIC_CFLAGS += -I/usr/include/elfutils -DPERF_HAVE_DWARF_REGS +else + msg := $(warning DWARF register mappings have not been defined for architecture $(ARCH), DWARF support will be limited); +endif # PERF_HAVE_DWARF_REGS +endif # NO_DWARF +endif # Dwarf support ifneq ($(shell sh -c "(echo '\#include '; echo 'int main(void) { newtInit(); newtCls(); return newtFinished(); }') | $(CC) -x c - $(ALL_CFLAGS) -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -lnewt -o $(BITBUCKET) $(ALL_LDFLAGS) $(EXTLIBS) "$(QUIET_STDERR)" && echo y"), y) msg := $(warning newt not found, disables TUI support. Please install newt-devel or libnewt-dev); diff --git a/tools/perf/arch/x86/Makefile b/tools/perf/arch/x86/Makefile new file mode 100644 index 000..cbd7833 --- /dev/null +++ b/tools/perf/arch/x86/Makefile @@ -0,0 +1,2 @@ +PERF_HAVE_DWARF_REGS := 1 +LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/dwarf-regs.o diff --git a/tools/perf/arch/x86/util/dwarf-regs.c b/tools/perf/arch/x86/util/dwarf-regs.c new file mode 100644 index 000..a794d30 --- /dev/null +++ b/tools/perf/arch/x86/util/dwarf-regs.c @@ -0,0 +1,75 @@ +/* + * dwarf-regs.c : Mapping of DWARF debug register numbers into register names. + * Extracted from probe-finder.c + * + * Written by Masami Hiramatsu + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This pr
perf: Split out arch specific code & improve PowerPC perf probe support
These patches add the required mappings to use perf probe on PowerPC. Functionality wise it requires the patch titled "powerpc: Add kprobe-based event tracer" from the powerpc-next tree to provide the HAVE_REGS_AND_STACK_ACCESS_API required for CONFIG_KPROBE_EVENT. The code will still compile cleanly without it and will fail gracefully at runtime on the missing CONFIG_KPROBE_EVENT support as before. Part 1 of the patch series moves the arch dependent x86 32 and 64 bit DWARF register number mappings out into a separate arch directory and adds the necessary Makefile foo to use it. Part 2 of the patch series adds the PowerPC mappings. Changes since v1: From Masami Hiramatsu's suggestion, I added a check in the Makefile for if the arch specific Makefile defines PERF_HAVE_DWARF_REGS, printing a message during build if it has not. This simplifies the code removing the odd macro from the previous version and the need for an arch specific arch_dwarf-regs.h. I have not entirely disabled DWARF support for architectures that don't implement the register mappings, so that they can still add a probe based on a line number (they will be missing the ability to capture the value of a variable from a register). Thanks, -Ian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5] sched: fix capacity calculations for SMT4
In message <1271161766.4807.1280.ca...@twins> you wrote: > On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote: > > When calculating capacity we use the following calculation: > >=20 > >capacity =3D DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE); > >=20 > > In SMT2, power will be 1178/2 (provided we are not scaling power with > > freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity > > being 1. > >=20 > > With SMT4 however, power will be 1178/4, hence capacity will end up as > > 0. > >=20 > > Fix this by ensuring capacity is always at least 1 after this > > calculation. > >=20 > > Signed-off-by: Michael Neuling > > --- > > I'm not sure this is the correct fix but this works for me. =20 > > Right, so I suspect this will indeed break some things. > > We initially allowed 0 capacity for when a cpu is consumed by an RT task > and there simply isn't much capacity left, in that case you really want > to try and move load to your sibling cpus if possible. Changing the CPU power based on what tasks are running on them seems a bit wrong to me. Shouldn't we keep those concepts separate? > However you're right that this goes awry in your case. > > One thing to look at is if that 15% increase is indeed representative > for the power7 cpu, it having 4 SMT threads seems to suggest there was > significant gains, otherwise they'd not have wasted the silicon. There are certainly, for most workloads, per core gains for SMT4 over SMT2 on P7. My kernels certainly compile faster and that's the only workload anyone who matters cares about ;-) > (The broken x86 code was meant to actually compute the SMT gain, so that > we'd not have to guess the 15%) > > Now, increasing this will only marginally fix the issue, since if you > end up with 512 per thread it only takes a very tiny amount of RT > workload to drop below and end up at 0 again. I tried initialled to make smt_gain programmable and at 2048 per core (512 per thread), the packing became unstable, as you intimated. > One thing we could look at is using the cpu base power to compute > capacity from. We'd have to add another field to sched_group and store > power before we do the scale_rt_power() stuff. Separating capacity from what RT tasks are running seems like a good idea to me. This would fix the RT issue, but it's not clear to me how you are suggesting fixing the rounding down to 0 SMT4 issue. Are you suggesting we bump smt_gain to say 2048 + 15%? Or are you suggesting we separate the RT tasks out from capacity and keep the max(1, capacity) that I've added? Or something else? Would another possibility be changing capacity a scaled value (like cpu_power is now) rather than a small integer as it is now. For example, a scaled capacity of 1024 would be equivalent to a capacity of 1 now. This might enable us to handle partial capacities better? We'd probably have to scale a bunch of nr_running too. Mikey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [Patch 1/1] PPC64-HWBKPT: Implement hw-breakpoints for PPC64
On Wed, Apr 07, 2010 at 06:03:31PM +1000, Benjamin Herrenschmidt wrote: > Ok so too many problems with your last patch, I didn't have time to fix > them all, so it's not going into -next this week. > > Please, test with a variety of defconfigs (iseries, cell, g5 for > example), and especially with CONFIG_PERF_EVENTS not set. There are > issues in the generic header for that (though I'm told some people are > working on a fix). > > Basically, we need something like CONFIG_HW_BREAKPOINTS that is set > (maybe optionally, maybe not) with CONFIG_PERF_EVENTS and > CONFIG_HAVE_HW_BREAKPOINT. Then you can use that to properly fix the > ifdef'ing in include/linux/hw_breakpoints.h, and fix the various other > cases in powerpc code where you are testing for the wrong thing. > > Cheers, > Ben. > Hi Ben, I've sent a new patch (linuxppc-dev message-id ref: 20100414034340.ga6...@in.ibm.com) that builds against the defconfigs for various architectures pointed out by you (I did see quite a few errors that aren't induced by the patch). The source tree is buildable even without CONFIG_PERF_EVENTS, and is limited to scope using CONFIG_HAVE_HW_BREAKPOINT. At this stage, I didnot find a need for a seperate CONFIG_HW_BREAKPOINTS though. Let me know what you think. Thanks, K.Prasad ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[Patch 2/2] PPC64-HWBKPT: Implement hw-breakpoints for PPC64
Implement perf-events based hw-breakpoint interfaces for PPC64 processors. These interfaces help arbitrate requests from various users and schedules them as appropriate. Signed-off-by: K.Prasad --- arch/powerpc/Kconfig |1 arch/powerpc/include/asm/cputable.h |4 arch/powerpc/include/asm/hw_breakpoint.h | 45 arch/powerpc/include/asm/processor.h |8 arch/powerpc/kernel/Makefile |1 arch/powerpc/kernel/hw_breakpoint.c | 324 +++ arch/powerpc/kernel/machine_kexec_64.c |3 arch/powerpc/kernel/process.c|6 arch/powerpc/kernel/ptrace.c | 81 +++ arch/powerpc/lib/Makefile|1 include/linux/hw_breakpoint.h|1 11 files changed, 475 insertions(+) Index: linux-2.6.ppc64_test/arch/powerpc/include/asm/hw_breakpoint.h === --- /dev/null +++ linux-2.6.ppc64_test/arch/powerpc/include/asm/hw_breakpoint.h @@ -0,0 +1,45 @@ +#ifndef _PPC_BOOK3S_64_HW_BREAKPOINT_H +#define _PPC_BOOK3S_64_HW_BREAKPOINT_H + +#ifdef __KERNEL__ +#define__ARCH_HW_BREAKPOINT_H +#ifdef CONFIG_HAVE_HW_BREAKPOINT + +struct arch_hw_breakpoint { + u8 len; /* length of the target data symbol */ + int type; + unsigned long address; +}; + +#include +#include +#include + +struct perf_event; +struct pmu; +struct perf_sample_data; + +#define HW_BREAKPOINT_ALIGN 0x7 +/* Maximum permissible length of any HW Breakpoint */ +#define HW_BREAKPOINT_LEN 0x8 + +extern int arch_validate_hwbkpt_settings(struct perf_event *bp, + struct task_struct *tsk); +extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused, + unsigned long val, void *data); +int arch_install_hw_breakpoint(struct perf_event *bp); +void arch_uninstall_hw_breakpoint(struct perf_event *bp); +void hw_breakpoint_pmu_read(struct perf_event *bp); +extern void flush_ptrace_hw_breakpoint(struct task_struct *tsk); + +extern struct pmu perf_ops_bp; +extern void ptrace_triggered(struct perf_event *bp, int nmi, + struct perf_sample_data *data, struct pt_regs *regs); +static inline void hw_breakpoint_disable(void) +{ + set_dabr(0); +} + +#endif /* CONFIG_HAVE_HW_BREAKPOINT */ +#endif /* __KERNEL__ */ +#endif /* _PPC_BOOK3S_64_HW_BREAKPOINT_H */ Index: linux-2.6.ppc64_test/arch/powerpc/kernel/hw_breakpoint.c === --- /dev/null +++ linux-2.6.ppc64_test/arch/powerpc/kernel/hw_breakpoint.c @@ -0,0 +1,324 @@ +/* + * HW_breakpoint: a unified kernel/user-space hardware breakpoint facility, + * using the CPU's debug registers. Derived from + * "arch/x86/kernel/hw_breakpoint.c" + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright 2009 IBM Corporation + * Author: K.Prasad + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +/* + * Store the 'bp' that caused the hw-breakpoint exception just before we + * single-step. Use it to distinguish a single-step exception (due to a + * previous hw-breakpoint exception) from a normal one + */ +static DEFINE_PER_CPU(struct perf_event *, last_hit_bp); + +/* + * Stores the breakpoints currently in use on each breakpoint address + * register for every cpu + */ +static DEFINE_PER_CPU(struct perf_event *, bp_per_reg); + +/* + * Install a perf counter breakpoint. + * + * We seek a free debug address register and use it for this + * breakpoint. + * + * Atomic: we hold the counter->ctx->lock and we only handle variables + * and registers local to this cpu. + */ +int arch_install_hw_breakpoint(struct perf_event *bp) +{ + struct arch_hw_breakpoint *info = counter_arch_bp(bp); + struct perf_event **slot = &__get_cpu_var(bp_per_reg); + + *slot = bp; + set_dabr(info->address | info->type | DABR_TRANSLATION); + return 0; +} + +/* + * Uninstall the breakpoint contained in the given counter. + * + * First we search the debug address register it uses and then we disable + * it. + * + * Atomic: we hold the
[Patch 1/2] PPC64-HWBKPT: Disable interrupts for data breakpoint exceptions
Data address breakpoint exceptions are currently handled along with page-faults which require interrupts to remain in enabled state. Since exception handling for data breakpoints aren't pre-empt safe, we handle them separately. Signed-off-by: K.Prasad --- arch/powerpc/kernel/exceptions-64s.S | 13 - arch/powerpc/mm/fault.c |5 +++-- 2 files changed, 15 insertions(+), 3 deletions(-) Index: linux-2.6.ppc64_test/arch/powerpc/kernel/exceptions-64s.S === --- linux-2.6.ppc64_test.orig/arch/powerpc/kernel/exceptions-64s.S +++ linux-2.6.ppc64_test/arch/powerpc/kernel/exceptions-64s.S @@ -735,8 +735,11 @@ _STATIC(do_hash_page) std r3,_DAR(r1) std r4,_DSISR(r1) - andis. r0,r4,0xa450/* weird error? */ + andis. r0,r4,0xa410/* weird error? */ bne-handle_page_fault /* if not, try to insert a HPTE */ + andis. r0,r4,dsisr_dabrma...@h + bne-handle_dabr_fault + BEGIN_FTR_SECTION andis. r0,r4,0x0020/* Is it a segment table fault? */ bne-do_ste_alloc/* If so handle it */ @@ -823,6 +826,14 @@ END_FW_FTR_SECTION_IFCLR(FW_FEATURE_ISER bl .raw_local_irq_restore b 11f +/* We have a data breakpoint exception - handle it */ +handle_dabr_fault: + ld r4,_DAR(r1) + ld r5,_DSISR(r1) + addir3,r1,STACK_FRAME_OVERHEAD + bl .do_dabr + b .ret_from_except_lite + /* Here we have a page fault that hash_page can't handle. */ handle_page_fault: ENABLE_INTS Index: linux-2.6.ppc64_test/arch/powerpc/mm/fault.c === --- linux-2.6.ppc64_test.orig/arch/powerpc/mm/fault.c +++ linux-2.6.ppc64_test/arch/powerpc/mm/fault.c @@ -151,13 +151,14 @@ int __kprobes do_page_fault(struct pt_re if (!user_mode(regs) && (address >= TASK_SIZE)) return SIGSEGV; -#if !(defined(CONFIG_4xx) || defined(CONFIG_BOOKE)) +#if !(defined(CONFIG_4xx) || defined(CONFIG_BOOKE) || \ +defined(CONFIG_PPC_BOOK3S_64)) if (error_code & DSISR_DABRMATCH) { /* DABR match */ do_dabr(regs, address, error_code); return 0; } -#endif /* !(CONFIG_4xx || CONFIG_BOOKE)*/ +#endif if (in_atomic() || mm == NULL) { if (!user_mode(regs)) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[Patch 0/2] PPC64-HWBKPT: Hardware Breakpoint interfaces - ver XVII
Hi Ben, Please find a new version of the patchset. It contains small, yet significant number of changes to address the build issues you pointed out (details of which are listed in the changelog below). Changelog - ver XVII (Version XVI: linuxppc-dev ref: 20100330095809.ga14...@in.ibm.com) - CONFIG_HAVE_HW_BREAKPOINT is now used to define the scope of the new code (in lieu of CONFIG_PPC_BOOK3S_64). - CONFIG_HAVE_HW_BREAKPOINT is now dependant upon CONFIG_PERF_EVENTS and CONFIG_PPC_BOOK3S_64 (to overcome build failures under certain configs). - Included a target in arch/powerpc/lib/Makefile to build sstep.o when HAVE_HW_BREAKPOINT. - Added a dummy definition for hw_breakpoint_disable() when !HAVE_HW_BREAKPOINT. - Tested builds under defconfigs for ppc64, cell and g5. Found no patch induced failures. Kindly accept them to be a part of -next tree. Thanks, K.Prasad Changelog - ver XVI (Version XV: linuxppc-dev ref: 20100323140639.ga21...@in.ibm.com) - Used a new config option CONFIG_PPC_BOOK3S_64 (in lieu of CONFIG_PPC64/CPU_FTR_HAS_DABR) to limit the scope of the new code. - Disabled breakpoints before kexec of the machine using hw_breakpoint_disable(). - Minor optimisation in exception-64s.S to check for data breakpoint exceptions in DSISR finally (after check for other causes) + changes in code comments and representation of DSISR_DABRMATCH constant. - Rebased to commit ae6be51ed01d6c4aaf249a207b4434bc7785853b of linux-2.6. Changelog - ver XV (Version XIV: linuxppc-dev ref: 20100308181232.ga3...@in.ibm.com) - Additional patch to disable interrupts during data breakpoint exception handling. - Moved HBP_NUM definition to cputable.h under a new CPU_FTR definition (CPU_FTR_HAS_DABR). - Filtering of extraneous exceptions (due to accesses outside symbol length) is by-passed for ptrace requests. - Removed flush_ptrace_hw_breakpoint() from __switch_to() due to incorrect coding placement. - Changes to code comments as per community reviews for previous version. - Minor coding-style changes in hw_breakpoint.c as per review comments. - Re-based to commit ae6be51ed01d6c4aaf249a207b4434bc7785853b of linux-2.6 Changelog - ver XIV (Version XIII: linuxppc-dev ref: 20100215055605.gb3...@in.ibm.com) - Removed the 'name' field from 'struct arch_hw_breakpoint'. - All callback invocations through bp->overflow_handler() are replaced with perf_bp_event(). - Removed the check for pre-existing single-stepping mode in hw_breakpoint_handler() as this check is unreliable while in kernel-space. Side effect of this change is the non-triggering of hw-breakpoints while single-stepping kernel through KGDB or Xmon. - Minor code-cleanups and addition of comments in hw_breakpoint_handler() and single_step_dabr_instruction(). - Re-based to commit 25cf84cf377c0aae5dbcf937ea89bc7893db5176 of linux-2.6 Changelog - ver XIII (Version XII: linuxppc-dev ref: 20100121084640.ga3...@in.ibm.com) - Fixed a bug for user-space breakpoints (triggered following the failure of a breakpoint request). - Re-based on commit 724e6d3fe8003c3f60bf404bf22e4e331327c596 of linux-2.6 Changelog - ver XII (Version XI: linuxppc-dev ref: 20100119091234.ga9...@in.ibm.com) - Unset MSR_SE only if kernel was not previously in single-step mode. - Pre-emption is now enabled before returning from the hw-breakpoint exception handler. - Variables to track the source of single-step exception (breakpoint from kernel, user-space vs single-stepping due to other requests) are added. - Extraneous hw-breakpoint exceptions (due to memory accesses lying outside monitored symbol length) is now done for both kernel and user-space (previously only user-space). - single_step_dabr_instruction() now returns NOTIFY_DONE if kernel was in single-step mode even before the hw-breakpoint. This enables other users of single-step mode to be notified of the exception. - User-space instructions are not emulated from kernel-space, they are instead single-stepped. Changelog - ver XI -- (Version X: linuxppc-dev ref: 20091211160144.ga23...@in.ibm.com) - Conditionally unset MSR_SE in the single-step handler - Added comments to explain the duration and need for pre-emption disable following hw-breakpoint exception. Changelog - ver X -- - Re-write the PPC64 patches for the new implementation of hw-breakpoints that uses the perf-layer. - Rebased to commit 7622fc234190a37d4e9fe3ed944a2b61a63fca03 of -tip. Changelog - ver IX --- - Invocation of user-defined callback will be 'trigger-after-execute' (except for ptrace). - Creation of a new global per-CPU breakpoint structure to help invocation of user-defined callback from single-step handler. (Changes made now) - Validation before registration will fail only if the address does not match the kernel symbol's (if s
Re: Possible bug with mutex adaptative spinning
On Wed, 2010-04-14 at 12:35 +1000, Benjamin Herrenschmidt wrote: > Hi Peter ! > > I -may- have found a bug with mutex adaptative spinning. We hit it when > torture testing CPU unplug. .../... In fact, I wonder if there's another potential problem: If the owner is actually running, it may do so for a very long time. It looks to me that everybody trying to take the mutex will thus spin and never get out of the spin loop until the owner stops running. That sounds very wrong to me :-) Shouldn't you at least remove the test for !owner when testing need_resched() in __mutex_lock_common() ? Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Possible bug with mutex adaptative spinning
Hi Peter ! I -may- have found a bug with mutex adaptative spinning. We hit it when torture testing CPU unplug. Basically, what happens is mutex_spin_on_owner() returns 1 if the owner CPU is offline. That means that the caller (__mutex_lock_common()) will spin until the mutex is released since there's a valid owner (so the need_resched() test doesn't trigger). This will deadlock if this is the only remaining CPU online. In fact, it even deadlocks with more than 1 CPU online, for example, I have a case where the 2 remaining online CPUs got into __mutex_lock_common() at the same time, and so got into that spin loop before any of them added itself to the wait_list, thus never hitting the exist condition there. I believe your test against nr_cpumask_bits is also a bit fragile for similar reasons. IE. You have what looks like a valid owner but it seems to be on an invalid CPU. It could be a freed thread_info, in which case returning 1 is fine, but if it's anything else, kaboom. I think it's better to be safe than sorry here and just go to sleep (ie return 0). Same comment with the DEBUG_PAGE_ALLOC case. In fact, this (untested) patch makes the function simpler and I don't think will have any negative effect on performances. Let me know what you think: cut here [PATCH] mutex: Don't spin when the owner CPU is offline or other weird cases The current code might spin forever if the CPU owning the mutex has been offlined, and the last CPU in the system is trying to acquire it, since mutex_spin_on_owner() will always return 1, telling the caller to spin until the mutex has been released. This patch changes mutex_spin_on_owner() to return 0 (don't spin) in any case where we aren't sure about the owner struct validity or CPU number, and if the said CPU is offline. There is no point going back & re-evaluate spinning in corner cases like that, let's just go to sleep. Signed-off-by: Benjamin Herrenschmidt --- diff --git a/kernel/sched.c b/kernel/sched.c index a3dff1f..11b7f4a 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3780,7 +3780,7 @@ int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner) * the mutex owner just released it and exited. */ if (probe_kernel_address(&owner->cpu, cpu)) - goto out; + return 0; #else cpu = owner->cpu; #endif @@ -3788,16 +3788,21 @@ int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner) /* * Even if the access succeeded (likely case), * the cpu field may no longer be valid. +* +* We stop spinning in this case since the owner may never */ if (cpu >= nr_cpumask_bits) - goto out; + return 0; /* * We need to validate that we can do a * get_cpu() and that we have the percpu area. +* +* If the CPU is offline, the owner will never release it so +* we must not spin */ if (!cpu_online(cpu)) - goto out; + return 0; rq = cpu_rq(cpu); @@ -3816,7 +3821,6 @@ int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner) cpu_relax(); } -out: return 1; } #endif ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/5] sched: make fix_small_imbalance work with asymmetric packing
On Tue, 2010-04-13 at 05:29 -0700, Peter Zijlstra wrote: > On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote: > > With the asymmetric packing infrastructure, fix_small_imbalance is > > causing idle higher threads to pull tasks off lower threads. > > > > This is being caused by an off-by-one error. > > > > Signed-off-by: Michael Neuling > > --- > > I'm not sure this is the right fix but without it, higher threads pull > > tasks off the lower threads, then the packing pulls it back down, etc > > etc and tasks bounce around constantly. > > Would help if you expand upon the why/how it manages to get pulled up. > > I can't immediately spot anything wrong with the patch, but then that > isn't my favourite piece of code either.. Suresh, any comments? > Sorry didn't pay much attention to this patchset. But based on the comments from Michael and looking at this patchset, it has SMT/MC implications. I will review and run some tests and get back in a day. As far as this particular patch is concerned, original code is coming from Ingo's original CFS code commit (dd41f596) and the below hunk pretty much explains what the change is about. - if (max_load - this_load >= busiest_load_per_task * imbn) { + if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >= + busiest_load_per_task * imbn) { So the below proposed change will probably break what the above mentioned commit was trying to achieve, which is: for fairness reasons we were bouncing the small extra load (between the max_load and this_load) around. > > --- > > > > kernel/sched_fair.c |2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > Index: linux-2.6-ozlabs/kernel/sched_fair.c > > === > > --- linux-2.6-ozlabs.orig/kernel/sched_fair.c > > +++ linux-2.6-ozlabs/kernel/sched_fair.c > > @@ -2652,7 +2652,7 @@ static inline void fix_small_imbalance(s > > * SCHED_LOAD_SCALE; > > scaled_busy_load_per_task /= sds->busiest->cpu_power; > > > > - if (sds->max_load - sds->this_load + scaled_busy_load_per_task >= > > + if (sds->max_load - sds->this_load + scaled_busy_load_per_task > > > (scaled_busy_load_per_task * imbn)) { > > *imbalance = sds->busiest_load_per_task; > > return; > thanks, suresh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc: Track backing pages allocated by vmemmap_populate()
On Tuesday 13 April 2010 21:16:44 Benjamin Herrenschmidt wrote: > On Tue, 2010-04-13 at 16:02 +1000, Mark Nelson wrote: > > That's a good question, and one that I probably should have added to > > the > > commit message. > > But, following through, it looks like we end up calling into > > __remove_section() from mm/memory_hotplug.c and if > > CONFIG_SPARSEMEM_VMEMMAP is enabled we just return EBUSY as freeing > > memmap with vmemmap isn't implemented yet. > > > > So for the moment, I'm not sure we have to worry about it. > > We probably don't. IE. The vmemmap will remain for those struct pages, > which means they won't need to be allocated again if some memory is > plugged back there. If not, we just waste a bit of memory. Not a big > deal. > Excellent! That makes sense. Thanks Ben! Mark. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH v3 3/6] RapidIO: Add Port-Write handling for EM
Andrew Morton wrote: > On Tue, 6 Apr 2010 17:22:38 -0400 Alexandre Bounine wrote: > > > > > From: Alexandre Bounine > > > > Add RapidIO Port-Write message handling in the context > > of Error Management Extensions Specification Rev.1.3. > > > > ... > > > > +static struct rio_dev *rio_get_comptag(u32 comp_tag, struct rio_dev *from) > > +{ > > + struct list_head *n; > > + struct rio_dev *rdev; > > + > > + WARN_ON(in_interrupt()); > > The check should be unneeded - lockdep will warn about this. Thank you. I will remove it in my next set of patches. > > > + spin_lock(&rio_global_list_lock); > > + n = from ? from->global_list.next : rio_devices.next; > > + > > + while (n && (n != &rio_devices)) { > > + rdev = rio_dev_g(n); > > + if (rdev->comp_tag == comp_tag) > > + goto exit; > > + n = n->next; > > + } > > + rdev = NULL; > > +exit: > > + spin_unlock(&rio_global_list_lock); > > + return rdev; > > +} > > + ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 2/2] [V5] Add non-Virtex5 support for LL TEMAC driver
> -Original Message- > From: David Miller [mailto:da...@davemloft.net] > Sent: Tuesday, April 13, 2010 2:34 AM > To: grant.lik...@secretlab.ca > Cc: John Linn; net...@vger.kernel.org; linuxppc-...@ozlabs.org; jwbo...@linux.vnet.ibm.com; > eric.duma...@gmail.com; john.willi...@petalogix.com; michal.si...@petalogix.com; jty...@cs.ucr.edu > Subject: Re: [PATCH 2/2] [V5] Add non-Virtex5 support for LL TEMAC driver > > From: Grant Likely > Date: Fri, 9 Apr 2010 12:10:21 -0 > > > On Thu, Apr 8, 2010 at 11:08 AM, John Linn wrote: > >> This patch adds support for using the LL TEMAC Ethernet driver on > >> non-Virtex 5 platforms by adding support for accessing the Soft DMA > >> registers as if they were memory mapped instead of solely through the > >> DCR's (available on the Virtex 5). > >> > >> The patch also updates the driver so that it runs on the MicroBlaze. > >> The changes were tested on the PowerPC 440, PowerPC 405, and the > >> MicroBlaze platforms. > >> > >> Signed-off-by: John Tyner > >> Signed-off-by: John Linn > > > > Picked up and build tested both patches on 405, 440, 60x and ppc64. > > No build problems found either built-in or as a module. > > > > for both: > > Acked-by: Grant Likely > > Ok, both applied to net-next-2.6, thanks everyone for sorting this > out. Great! Thanks David, appreciate the help. This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: Upgrade to 2.6.26 results in Oops during request_irq
>From: Sparks, Sam >Sent: Thursday, April 08, 2010 4:15 PM > >Howdy All, > >I have (almost) successfully upgraded from Linux 2.6.22 to 2.6.26 (both >downloaded from debian) on my mpc8347 powerpc, but I think I may be >missing a required change to my dts regarding the IPIC or maybe a change >in how my driver requests IRQs. > >I have several modules that are maintained outside the kernel tree, and >all but one is working correctly. The offending driver is attempting to >register IRQ 23, and is accessing an invalid memory location. I am >currently getting the following stack dump: Aha! I have found that change that caused my module to break: http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg09748.html This patch modified ipic_set_irq_type to override the handle_irq function pointer. Do I need to register a new function to handle falling edge triggered external interrupts? It appears the default is NULL. --Sam ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/5] sched: add asymmetric packing option for sibling domain
On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote: > Peter: Since this is based mainly off your initial patch, it should > have your signed-off-by too, but I didn't want to add without your > permission. Can I add it? Of course! :-) This thing does need a better changelog though, and maybe a larger comment with check_asym_packing(), explaining why and what we're doing and what we're assuming (that lower cpu number also means lower thread number). ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/5] sched: Mark the balance type for use in need_active_balance()
On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote: > need_active_balance() gates the asymmetric packing based due to power > save logic, but for packing we don't care. This explanation lacks a how/why. So the problem is that need_active_balance() ends up returning false and prevents the active balance from pulling a task to a lower available SMT sibling? > This marks the type of balanace we are attempting to do perform from > f_b_g() and stops need_active_balance() power save logic gating a > balance in the asymmetric packing case. At the very least this wants more comments in the code. I'm not really charmed by having to add yet another variable to pass around that mess, but I can't seem to come up with something cleaner either. > Signed-off-by: Michael Neuling > > --- > > kernel/sched_fair.c | 24 +++- > 1 file changed, 19 insertions(+), 5 deletions(-) > > Index: linux-2.6-ozlabs/kernel/sched_fair.c > === > --- linux-2.6-ozlabs.orig/kernel/sched_fair.c > +++ linux-2.6-ozlabs/kernel/sched_fair.c > @@ -91,6 +91,13 @@ const_debug unsigned int sysctl_sched_mi > > static const struct sched_class fair_sched_class; > > +enum balance_type { > + BALANCE_NONE = 0, > + BALANCE_LOAD, > + BALANCE_POWER, > + BALANCE_PACKING > +}; > + > /** > * CFS operations on generic schedulable entities: > */ > @@ -2783,7 +2790,8 @@ static inline void calculate_imbalance(s > static struct sched_group * > find_busiest_group(struct sched_domain *sd, int this_cpu, > unsigned long *imbalance, enum cpu_idle_type idle, > -int *sd_idle, const struct cpumask *cpus, int *balance) > +int *sd_idle, const struct cpumask *cpus, int *balance, > +enum balance_type *bt) > { > struct sd_lb_stats sds; > > @@ -2808,6 +2816,7 @@ find_busiest_group(struct sched_domain * > if (!(*balance)) > goto ret; > > + *bt = BALANCE_PACKING; > if ((idle == CPU_IDLE || idle == CPU_NEWLY_IDLE) && > check_asym_packing(sd, &sds, this_cpu, imbalance)) > return sds.busiest; > @@ -2828,6 +2837,7 @@ find_busiest_group(struct sched_domain * > > /* Looks like there is an imbalance. Compute it */ > calculate_imbalance(&sds, this_cpu, imbalance); > + *bt = BALANCE_LOAD; > return sds.busiest; > > out_balanced: > @@ -2835,10 +2845,12 @@ out_balanced: >* There is no obvious imbalance. But check if we can do some balancing >* to save power. >*/ > + *bt = BALANCE_POWER; > if (check_power_save_busiest_group(&sds, this_cpu, imbalance)) > return sds.busiest; > ret: > *imbalance = 0; > + *bt = BALANCE_NONE; > return NULL; > } > > @@ -2899,9 +2911,10 @@ find_busiest_queue(struct sched_group *g > /* Working cpumask for load_balance and load_balance_newidle. */ > static DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask); > > -static int need_active_balance(struct sched_domain *sd, int sd_idle, int > idle) > +static int need_active_balance(struct sched_domain *sd, int sd_idle, int > idle, > +enum balance_type *bt) > { > - if (idle == CPU_NEWLY_IDLE) { > + if (idle == CPU_NEWLY_IDLE && *bt != BALANCE_PACKING) { > /* >* The only task running in a non-idle cpu can be moved to this >* cpu in an attempt to completely freeup the other CPU > @@ -2946,6 +2959,7 @@ static int load_balance(int this_cpu, st > struct rq *busiest; > unsigned long flags; > struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask); > + enum balance_type bt; > > cpumask_copy(cpus, cpu_active_mask); > > @@ -2964,7 +2978,7 @@ static int load_balance(int this_cpu, st > redo: > update_shares(sd); > group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle, > -cpus, balance); > +cpus, balance, &bt); > > if (*balance == 0) > goto out_balanced; > @@ -3018,7 +3032,7 @@ redo: > schedstat_inc(sd, lb_failed[idle]); > sd->nr_balance_failed++; > > - if (need_active_balance(sd, sd_idle, idle)) { > + if (need_active_balance(sd, sd_idle, idle, &bt)) { > raw_spin_lock_irqsave(&busiest->lock, flags); > > /* don't kick the migration_thread, if the curr ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5] sched: fix capacity calculations for SMT4
On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote: > When calculating capacity we use the following calculation: > >capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE); > > In SMT2, power will be 1178/2 (provided we are not scaling power with > freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity > being 1. > > With SMT4 however, power will be 1178/4, hence capacity will end up as > 0. > > Fix this by ensuring capacity is always at least 1 after this > calculation. > > Signed-off-by: Michael Neuling > --- > I'm not sure this is the correct fix but this works for me. Right, so I suspect this will indeed break some things. We initially allowed 0 capacity for when a cpu is consumed by an RT task and there simply isn't much capacity left, in that case you really want to try and move load to your sibling cpus if possible. However you're right that this goes awry in your case. One thing to look at is if that 15% increase is indeed representative for the power7 cpu, it having 4 SMT threads seems to suggest there was significant gains, otherwise they'd not have wasted the silicon. (The broken x86 code was meant to actually compute the SMT gain, so that we'd not have to guess the 15%) Now, increasing this will only marginally fix the issue, since if you end up with 512 per thread it only takes a very tiny amount of RT workload to drop below and end up at 0 again. One thing we could look at is using the cpu base power to compute capacity from. We'd have to add another field to sched_group and store power before we do the scale_rt_power() stuff. Thoughts? > kernel/sched_fair.c |6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > Index: linux-2.6-ozlabs/kernel/sched_fair.c > === > --- linux-2.6-ozlabs.orig/kernel/sched_fair.c > +++ linux-2.6-ozlabs/kernel/sched_fair.c > @@ -1482,6 +1482,7 @@ static int select_task_rq_fair(struct ta > } > > capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE); > + capacity = max(capacity, 1UL); > > if (tmp->flags & SD_POWERSAVINGS_BALANCE) > nr_running /= 2; > @@ -2488,6 +2489,7 @@ static inline void update_sg_lb_stats(st > > sgs->group_capacity = > DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); > + sgs->group_capacity = max(sgs->group_capacity, 1UL); > } > > /** > @@ -2795,9 +2797,11 @@ find_busiest_queue(struct sched_group *g > > for_each_cpu(i, sched_group_cpus(group)) { > unsigned long power = power_of(i); > - unsigned long capacity = DIV_ROUND_CLOSEST(power, > SCHED_LOAD_SCALE); > + unsigned long capacity; > unsigned long wl; > > + capacity = max(DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE), 1UL); > + > if (!cpumask_test_cpu(i, cpus)) > continue; > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/5] sched: make fix_small_imbalance work with asymmetric packing
On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote: > With the asymmetric packing infrastructure, fix_small_imbalance is > causing idle higher threads to pull tasks off lower threads. > > This is being caused by an off-by-one error. > > Signed-off-by: Michael Neuling > --- > I'm not sure this is the right fix but without it, higher threads pull > tasks off the lower threads, then the packing pulls it back down, etc > etc and tasks bounce around constantly. Would help if you expand upon the why/how it manages to get pulled up. I can't immediately spot anything wrong with the patch, but then that isn't my favourite piece of code either.. Suresh, any comments? > --- > > kernel/sched_fair.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: linux-2.6-ozlabs/kernel/sched_fair.c > === > --- linux-2.6-ozlabs.orig/kernel/sched_fair.c > +++ linux-2.6-ozlabs/kernel/sched_fair.c > @@ -2652,7 +2652,7 @@ static inline void fix_small_imbalance(s >* SCHED_LOAD_SCALE; > scaled_busy_load_per_task /= sds->busiest->cpu_power; > > - if (sds->max_load - sds->this_load + scaled_busy_load_per_task >= > + if (sds->max_load - sds->this_load + scaled_busy_load_per_task > > (scaled_busy_load_per_task * imbn)) { > *imbalance = sds->busiest_load_per_task; > return; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc: Track backing pages allocated by vmemmap_populate()
On Tue, 2010-04-13 at 16:02 +1000, Mark Nelson wrote: > That's a good question, and one that I probably should have added to > the > commit message. > But, following through, it looks like we end up calling into > __remove_section() from mm/memory_hotplug.c and if > CONFIG_SPARSEMEM_VMEMMAP is enabled we just return EBUSY as freeing > memmap with vmemmap isn't implemented yet. > > So for the moment, I'm not sure we have to worry about it. We probably don't. IE. The vmemmap will remain for those struct pages, which means they won't need to be allocated again if some memory is plugged back there. If not, we just waste a bit of memory. Not a big deal. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 net-next 2/2] gianfar: Add hardware TX timestamping support
From: Manfred Rudigier Date: Fri, 9 Apr 2010 11:10:35 +0200 > If a packet has the skb_shared_tx->hardware flag set the device is > instructed to generate a TX timestamp and write it back to memory after > the frame is transmitted. During the clean_tx_ring operation the > timestamp will be extracted and copied into the skb_shared_hwtstamps > struct of the skb. > > TX timestamping is enabled by setting the tx_type to something else > than HWTSTAMP_TX_OFF with the SIOCSHWTSTAMP ioctl command. It is only > supported by eTSEC devices. > > Signed-off-by: Manfred Rudigier Applied to net-next-2.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 net-next 1/2] gianfar: Add hardware RX timestamping support
From: Manfred Rudigier Date: Fri, 9 Apr 2010 11:10:03 +0200 > The device is configured to insert hardware timestamps into all > received packets. The RX timestamps are extracted from the padding > alingment bytes during the clean_rx_ring operation and copied into the > skb_shared_hwtstamps struct of the skb. This extraction only happens if > the rx_filter was set to something else than HWTSTAMP_FILTER_NONE with > the SIOCSHWTSTAMP ioctl command. > > Hardware timestamping is only supported for eTSEC devices. To indicate > device support the new FSL_GIANFAR_DEV_HAS_TIMER flag was introduced. > > Signed-off-by: Manfred Rudigier Applied to net-next-2.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] perf: Move arch specific code into separate arch directory
Hi Masami, Thanks for the feedback Excerpts from Masami Hiramatsu's message of Mon Apr 12 22:37:03 +1000 2010: > Could you add a check whether the get_arch_regstr() is defined > (or dwarf-regs.h is exist) in Makefile? > If it is not defined, we'd better drop dwarf support(so set NO_DWARF), > because it means we haven't ported perf probe on that architecture yet. :) I was a little reluctant to do that when I first read your message because it felt like adding autoconf stuff into the Makefile and the code should already fail gracefully on architectures without the register mappings. But, since the Makefile already has some autoconf like magic and it would make the code cleaner, I'll play with the idea a bit more tomorrow. > > diff --git a/tools/perf/arch/x86/include/arch_dwarf-regs.h > > b/tools/perf/arch/x86/include/arch_dwarf-regs.h > > new file mode 100644 > > index 000..9e8da6a > > --- /dev/null > > +++ b/tools/perf/arch/x86/include/arch_dwarf-regs.h > > @@ -0,0 +1,6 @@ > > +#ifndef _PREF_ARCH_PPC_DWARF_REGS_H > > +#define _PREF_ARCH_PPC_DWARF_REGS_H > > _PREF_ARCH_X86_DWARF_REGS_H ? Nice catch (oops) Cheers, -Ian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] [V5] Add non-Virtex5 support for LL TEMAC driver
From: Grant Likely Date: Fri, 9 Apr 2010 12:10:21 -0600 > On Thu, Apr 8, 2010 at 11:08 AM, John Linn wrote: >> This patch adds support for using the LL TEMAC Ethernet driver on >> non-Virtex 5 platforms by adding support for accessing the Soft DMA >> registers as if they were memory mapped instead of solely through the >> DCR's (available on the Virtex 5). >> >> The patch also updates the driver so that it runs on the MicroBlaze. >> The changes were tested on the PowerPC 440, PowerPC 405, and the >> MicroBlaze platforms. >> >> Signed-off-by: John Tyner >> Signed-off-by: John Linn > > Picked up and build tested both patches on 405, 440, 60x and ppc64. > No build problems found either built-in or as a module. > > for both: > Acked-by: Grant Likely Ok, both applied to net-next-2.6, thanks everyone for sorting this out. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev