Re: [PATCH] powerpc/perf_event: Fix oops due to perf_event_do_pending call

2010-04-13 Thread Benjamin Herrenschmidt
On Wed, 2010-04-14 at 16:46 +1000, Paul Mackerras wrote:
> Ben, please put this in your tree of fixes to go to Linus for 2.6.34,
> since it fixes a potential panic.

Should it go to -stable as well ? How far back ?

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] powerpc/perf_event: Fix oops due to perf_event_do_pending call

2010-04-13 Thread Paul Mackerras
Anton Blanchard found that large POWER systems would occasionally
crash in the exception exit path when profiling with perf_events.
The symptom was that an interrupt would occur late in the exit path
when the MSR[RI] (recoverable interrupt) bit was clear.  Interrupts
should be hard-disabled at this point but they were enabled.  Because
the interrupt was not recoverable the system panicked.

The reason is that the exception exit path was calling
perf_event_do_pending after hard-disabling interrupts, and
perf_event_do_pending will re-enable interrupts.

The simplest and cleanest fix for this is to use the same mechanism
that 32-bit powerpc does, namely to cause a self-IPI by setting the
decrementer to 1.  This means we can remove the tests in the exception
exit path and raw_local_irq_restore.

This also makes sure that the call to perf_event_do_pending from
timer_interrupt() happens within irq_enter/irq_exit.  (Note that
calling perf_event_do_pending from timer_interrupt does not mean that
there is a possible 1/HZ latency; setting the decrementer to 1 ensures
that the timer interrupt will happen immediately, i.e. within one
timebase tick, which is a few nanoseconds or 10s of nanoseconds.)

Signed-off-by: Paul Mackerras 
Cc: sta...@kernel.org
---
Ben, please put this in your tree of fixes to go to Linus for 2.6.34,
since it fixes a potential panic.

 arch/powerpc/include/asm/hw_irq.h |   38 
 arch/powerpc/kernel/asm-offsets.c |1 
 arch/powerpc/kernel/entry_64.S|9 -
 arch/powerpc/kernel/irq.c |6 ---
 arch/powerpc/kernel/time.c|   60 ++
 5 files changed, 48 insertions(+), 66 deletions(-)
diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index 9f4c9d4..bd100fc 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -130,43 +130,5 @@ static inline int irqs_disabled_flags(unsigned long flags)
  */
 struct irq_chip;
 
-#ifdef CONFIG_PERF_EVENTS
-
-#ifdef CONFIG_PPC64
-static inline unsigned long test_perf_event_pending(void)
-{
-   unsigned long x;
-
-   asm volatile("lbz %0,%1(13)"
-   : "=r" (x)
-   : "i" (offsetof(struct paca_struct, perf_event_pending)));
-   return x;
-}
-
-static inline void set_perf_event_pending(void)
-{
-   asm volatile("stb %0,%1(13)" : :
-   "r" (1),
-   "i" (offsetof(struct paca_struct, perf_event_pending)));
-}
-
-static inline void clear_perf_event_pending(void)
-{
-   asm volatile("stb %0,%1(13)" : :
-   "r" (0),
-   "i" (offsetof(struct paca_struct, perf_event_pending)));
-}
-#endif /* CONFIG_PPC64 */
-
-#else  /* CONFIG_PERF_EVENTS */
-
-static inline unsigned long test_perf_event_pending(void)
-{
-   return 0;
-}
-
-static inline void clear_perf_event_pending(void) {}
-#endif /* CONFIG_PERF_EVENTS */
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_HW_IRQ_H */
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 957ceb7..c09138d 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -133,7 +133,6 @@ int main(void)
DEFINE(PACAKMSR, offsetof(struct paca_struct, kernel_msr));
DEFINE(PACASOFTIRQEN, offsetof(struct paca_struct, soft_enabled));
DEFINE(PACAHARDIRQEN, offsetof(struct paca_struct, hard_enabled));
-   DEFINE(PACAPERFPEND, offsetof(struct paca_struct, perf_event_pending));
DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id));
 #ifdef CONFIG_PPC_MM_SLICES
DEFINE(PACALOWSLICESPSIZE, offsetof(struct paca_struct,
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 07109d8..42e9d90 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -556,15 +556,6 @@ ALT_FW_FTR_SECTION_END_IFCLR(FW_FEATURE_ISERIES)
 2:
TRACE_AND_RESTORE_IRQ(r5);
 
-#ifdef CONFIG_PERF_EVENTS
-   /* check paca->perf_event_pending if we're enabling ints */
-   lbz r3,PACAPERFPEND(r13)
-   and.r3,r3,r5
-   beq 27f
-   bl  .perf_event_do_pending
-27:
-#endif /* CONFIG_PERF_EVENTS */
-
/* extract EE bit and use it to restore paca->hard_enabled */
ld  r3,_MSR(r1)
rldicl  r4,r3,49,63 /* r0 = (r3 >> 15) & 1 */
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 64f6f20..066bd31 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -53,7 +53,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -145,11 +144,6 @@ notrace void raw_local_irq_restore(unsigned long en)
}
 #endif /* CONFIG_PPC_STD_MMU_64 */
 
-   if (test_perf_event_pending()) {
-   clear_perf_event_pending();
-   perf_event_do_pending();
-   }
-
/*
 * if (get_paca()->hard_enabled) return;
 * But again we need t

Re: [PATCH 2/5] sched: add asymmetric packing option for sibling domain

2010-04-13 Thread Michael Neuling
In message <1271161767.4807.1281.ca...@twins> you wrote:
> On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> > Peter: Since this is based mainly off your initial patch, it should
> > have your signed-off-by too, but I didn't want to add without your
> > permission.  Can I add it?
> 
> Of course! :-)
> 
> This thing does need a better changelog though, and maybe a larger
> comment with check_asym_packing(), explaining why and what we're doing
> and what we're assuming (that lower cpu number also means lower thread
> number).
> 

OK, updated patch below...

Mikey


[PATCH 2/5] sched: add asymmetric group packing option for sibling domain

Check to see if the group is packed in a sched doman.

This is primarily intended to used at the sibling level.  Some cores
like POWER7 prefer to use lower numbered SMT threads.  In the case of
POWER7, it can move to lower SMT modes only when higher threads are
idle.  When in lower SMT modes, the threads will perform better since
they share less core resources.  Hence when we have idle threads, we
want them to be the higher ones.

This adds a hook into f_b_g() called check_asym_packing() to check the
packing.  This packing function is run on idle threads.  It checks to
see if the busiest CPU in this domain (core in the P7 case) has a
higher CPU number than what where the packing function is being run
on.  If it is, calculate the imbalance and return the higher busier
thread as the busiest group to f_b_g().  Here we are assuming a lower
CPU number will be equivalent to a lower SMT thread number.

It also creates a new SD_ASYM_PACKING flag to enable this feature at
any scheduler domain level.

It also creates an arch hook to enable this feature at the sibling
level.  The default function doesn't enable this feature.

Based heavily on patch from Peter Zijlstra.  

Signed-off-by: Michael Neuling 
Signed-off-by: Peter Zijlstra 
---
 include/linux/sched.h|4 +-
 include/linux/topology.h |1 
 kernel/sched_fair.c  |   93 +--
 3 files changed, 94 insertions(+), 4 deletions(-)

Index: linux-2.6-ozlabs/include/linux/sched.h
===
--- linux-2.6-ozlabs.orig/include/linux/sched.h
+++ linux-2.6-ozlabs/include/linux/sched.h
@@ -799,7 +799,7 @@ enum cpu_idle_type {
 #define SD_POWERSAVINGS_BALANCE0x0100  /* Balance for power savings */
 #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg 
resources */
 #define SD_SERIALIZE   0x0400  /* Only a single load balancing 
instance */
-
+#define SD_ASYM_PACKING0x0800  /* Place busy groups earlier in 
the domain */
 #define SD_PREFER_SIBLING  0x1000  /* Prefer to place tasks in a sibling 
domain */
 
 enum powersavings_balance_level {
@@ -834,6 +834,8 @@ static inline int sd_balance_for_package
return SD_PREFER_SIBLING;
 }
 
+extern int __weak arch_sd_sibiling_asym_packing(void);
+
 /*
  * Optimise SD flags for power savings:
  * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
Index: linux-2.6-ozlabs/include/linux/topology.h
===
--- linux-2.6-ozlabs.orig/include/linux/topology.h
+++ linux-2.6-ozlabs/include/linux/topology.h
@@ -102,6 +102,7 @@ int arch_update_cpu_topology(void);
| 1*SD_SHARE_PKG_RESOURCES  \
| 0*SD_SERIALIZE\
| 0*SD_PREFER_SIBLING   \
+   | arch_sd_sibiling_asym_packing()   \
,   \
.last_balance   = jiffies,  \
.balance_interval   = 1,\
Index: linux-2.6-ozlabs/kernel/sched_fair.c
===
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -2493,6 +2493,39 @@ static inline void update_sg_lb_stats(st
 }
 
 /**
+ * update_sd_pick_busiest - return 1 on busiest group
+ * @sd: sched_domain whose statistics are to be checked
+ * @sds: sched_domain statistics
+ * @sg: sched_group candidate to be checked for being the busiest
+ * @sds: sched_group statistics
+ *
+ * This returns 1 for the busiest group. If asymmetric packing is
+ * enabled and we already have a busiest, but this candidate group has
+ * a higher cpu number than the current busiest, pick this sg.
+ */
+static int update_sd_pick_busiest(struct sched_domain *sd,
+ struct sd_lb_stats *sds,
+ struct sched_group *sg,
+ struct sg_lb_stats *sgs)
+{
+   if (sgs->sum_nr_running > sgs->group_capacity)
+   return 1;
+
+   if (sgs->group_imb)
+   return 1;

[PATCH v2 2/2] perf probe: Add PowerPC DWARF register number mappings

2010-04-13 Thread Ian Munsie
From: Ian Munsie 

This patch adds mappings from the register numbers from DWARF to the
register names used in the PowerPC Regs and Stack Access API. This
allows perf probe to be used to record variable contents on PowerPC.

This patch depends on functionality in the powerpc/next tree, though it
will compile fine without it. Specifically this patch depends on commit
"powerpc: Add kprobe-based event tracer"

Signed-off-by: Ian Munsie 
---
 Changes since v1: Removed arch specific arch_dwarf-regs.h and defined
 PERF_HAVE_DWARF_REGS in the arch specific Makefile.

 tools/perf/arch/powerpc/Makefile  |2 +
 tools/perf/arch/powerpc/util/dwarf-regs.c |   88 +
 2 files changed, 90 insertions(+), 0 deletions(-)
 create mode 100644 tools/perf/arch/powerpc/Makefile
 create mode 100644 tools/perf/arch/powerpc/util/dwarf-regs.c

diff --git a/tools/perf/arch/powerpc/Makefile b/tools/perf/arch/powerpc/Makefile
new file mode 100644
index 000..cbd7833
--- /dev/null
+++ b/tools/perf/arch/powerpc/Makefile
@@ -0,0 +1,2 @@
+PERF_HAVE_DWARF_REGS := 1
+LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/dwarf-regs.o
diff --git a/tools/perf/arch/powerpc/util/dwarf-regs.c 
b/tools/perf/arch/powerpc/util/dwarf-regs.c
new file mode 100644
index 000..48ae0c5
--- /dev/null
+++ b/tools/perf/arch/powerpc/util/dwarf-regs.c
@@ -0,0 +1,88 @@
+/*
+ * Mapping of DWARF debug register numbers into register names.
+ *
+ * Copyright (C) 2010 Ian Munsie, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+
+
+struct pt_regs_dwarfnum {
+   const char *name;
+   unsigned int dwarfnum;
+};
+
+#define STR(s) #s
+#define REG_DWARFNUM_NAME(r, num) {.name = r, .dwarfnum = num}
+#define GPR_DWARFNUM_NAME(num) \
+   {.name = STR(%gpr##num), .dwarfnum = num}
+#define REG_DWARFNUM_END {.name = NULL, .dwarfnum = 0}
+
+/*
+ * Reference:
+ * http://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.9.html
+ */
+static const struct pt_regs_dwarfnum regdwarfnum_table[] = {
+   GPR_DWARFNUM_NAME(0),
+   GPR_DWARFNUM_NAME(1),
+   GPR_DWARFNUM_NAME(2),
+   GPR_DWARFNUM_NAME(3),
+   GPR_DWARFNUM_NAME(4),
+   GPR_DWARFNUM_NAME(5),
+   GPR_DWARFNUM_NAME(6),
+   GPR_DWARFNUM_NAME(7),
+   GPR_DWARFNUM_NAME(8),
+   GPR_DWARFNUM_NAME(9),
+   GPR_DWARFNUM_NAME(10),
+   GPR_DWARFNUM_NAME(11),
+   GPR_DWARFNUM_NAME(12),
+   GPR_DWARFNUM_NAME(13),
+   GPR_DWARFNUM_NAME(14),
+   GPR_DWARFNUM_NAME(15),
+   GPR_DWARFNUM_NAME(16),
+   GPR_DWARFNUM_NAME(17),
+   GPR_DWARFNUM_NAME(18),
+   GPR_DWARFNUM_NAME(19),
+   GPR_DWARFNUM_NAME(20),
+   GPR_DWARFNUM_NAME(21),
+   GPR_DWARFNUM_NAME(22),
+   GPR_DWARFNUM_NAME(23),
+   GPR_DWARFNUM_NAME(24),
+   GPR_DWARFNUM_NAME(25),
+   GPR_DWARFNUM_NAME(26),
+   GPR_DWARFNUM_NAME(27),
+   GPR_DWARFNUM_NAME(28),
+   GPR_DWARFNUM_NAME(29),
+   GPR_DWARFNUM_NAME(30),
+   GPR_DWARFNUM_NAME(31),
+   REG_DWARFNUM_NAME("%msr",   66),
+   REG_DWARFNUM_NAME("%ctr",   109),
+   REG_DWARFNUM_NAME("%link",  108),
+   REG_DWARFNUM_NAME("%xer",   101),
+   REG_DWARFNUM_NAME("%dar",   119),
+   REG_DWARFNUM_NAME("%dsisr", 118),
+   REG_DWARFNUM_END,
+};
+
+/**
+ * get_arch_regstr() - lookup register name from it's DWARF register number
+ * @n: the DWARF register number
+ *
+ * get_arch_regstr() returns the name of the register in struct
+ * regdwarfnum_table from it's DWARF register number. If the register is not
+ * found in the table, this returns NULL;
+ */
+const char *get_arch_regstr(unsigned int n)
+{
+   const struct pt_regs_dwarfnum *roff;
+   for (roff = regdwarfnum_table; roff->name != NULL; roff++)
+   if (roff->dwarfnum == n)
+   return roff->name;
+   return NULL;
+}
-- 
1.7.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 1/2] perf: Move arch specific code into separate arch directory

2010-04-13 Thread Ian Munsie
From: Ian Munsie 

The perf userspace tool included some architecture specific code to map
registers from the DWARF register number into the names used by the regs
and stack access API.

This patch moves the architecture specific code out into a separate
arch/x86 directory along with the infrastructure required to use it.

Signed-off-by: Ian Munsie 
---
Changes since v1: From Masami Hiramatsu's suggestion, I added a check in the
Makefile for if the arch specific Makefile defines PERF_HAVE_DWARF_REGS,
printing a message during build if it has not. This simplifies the code
removing the odd macro from the previous version and the need for an arch
specific arch_dwarf-regs.h. I have not entirely disabled DWARF support for
architectures that don't implement the register mappings, so that they can
still add a probe based on a line number (they will be missing the ability to
capture the value of a variable from a register).

 tools/perf/Makefile   |   27 ++-
 tools/perf/arch/x86/Makefile  |2 +
 tools/perf/arch/x86/util/dwarf-regs.c |   75 +
 tools/perf/util/include/dwarf-regs.h  |6 +++
 tools/perf/util/probe-finder.c|   56 +++--
 5 files changed, 113 insertions(+), 53 deletions(-)
 create mode 100644 tools/perf/arch/x86/Makefile
 create mode 100644 tools/perf/arch/x86/util/dwarf-regs.c
 create mode 100644 tools/perf/util/include/dwarf-regs.h

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 57b3569..7f71daf 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -173,6 +173,20 @@ uname_R := $(shell sh -c 'uname -r 2>/dev/null || echo 
not')
 uname_P := $(shell sh -c 'uname -p 2>/dev/null || echo not')
 uname_V := $(shell sh -c 'uname -v 2>/dev/null || echo not')
 
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/ -e s/sun4u/sparc64/ \
+ -e s/arm.*/arm/ -e s/sa110/arm/ \
+ -e s/s390x/s390/ -e s/parisc64/parisc/ \
+ -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
+ -e s/sh[234].*/sh/ )
+
+# Additional ARCH settings for x86
+ifeq ($(ARCH),i386)
+ARCH := x86
+endif
+ifeq ($(ARCH),x86_64)
+ARCH := x86
+endif
+
 # CFLAGS and LDFLAGS are for the users to override from the command line.
 
 #
@@ -285,7 +299,7 @@ endif
 # Those must not be GNU-specific; they are shared with perl/ which may
 # be built by a different compiler. (Note that this is an artifact now
 # but it still might be nice to keep that distinction.)
-BASIC_CFLAGS = -Iutil/include
+BASIC_CFLAGS = -Iutil/include -Iarch/$(ARCH)/include
 BASIC_LDFLAGS =
 
 # Guard against environment variables
@@ -367,6 +381,7 @@ LIB_H += util/include/asm/byteorder.h
 LIB_H += util/include/asm/swab.h
 LIB_H += util/include/asm/system.h
 LIB_H += util/include/asm/uaccess.h
+LIB_H += util/include/dwarf-regs.h
 LIB_H += perf.h
 LIB_H += util/cache.h
 LIB_H += util/callchain.h
@@ -485,6 +500,7 @@ PERFLIBS = $(LIB_FILE)
 
 -include config.mak.autogen
 -include config.mak
+-include arch/$(ARCH)/Makefile
 
 ifeq ($(uname_S),Darwin)
ifndef NO_FINK
@@ -525,8 +541,13 @@ ifndef NO_DWARF
BASIC_CFLAGS += -I/usr/include/elfutils -DDWARF_SUPPORT
EXTLIBS += -lelf -ldw
LIB_OBJS += $(OUTPUT)util/probe-finder.o
-endif
-endif
+ifneq ($(origin PERF_HAVE_DWARF_REGS), undefined)
+   BASIC_CFLAGS += -I/usr/include/elfutils -DPERF_HAVE_DWARF_REGS
+else
+   msg := $(warning DWARF register mappings have not been defined for 
architecture $(ARCH), DWARF support will be limited);
+endif # PERF_HAVE_DWARF_REGS
+endif # NO_DWARF
+endif # Dwarf support
 
 ifneq ($(shell sh -c "(echo '\#include '; echo 'int main(void) { 
newtInit(); newtCls(); return newtFinished(); }') | $(CC) -x c - $(ALL_CFLAGS) 
-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -lnewt -o $(BITBUCKET) 
$(ALL_LDFLAGS) $(EXTLIBS) "$(QUIET_STDERR)" && echo y"), y)
msg := $(warning newt not found, disables TUI support. Please install 
newt-devel or libnewt-dev);
diff --git a/tools/perf/arch/x86/Makefile b/tools/perf/arch/x86/Makefile
new file mode 100644
index 000..cbd7833
--- /dev/null
+++ b/tools/perf/arch/x86/Makefile
@@ -0,0 +1,2 @@
+PERF_HAVE_DWARF_REGS := 1
+LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/dwarf-regs.c 
b/tools/perf/arch/x86/util/dwarf-regs.c
new file mode 100644
index 000..a794d30
--- /dev/null
+++ b/tools/perf/arch/x86/util/dwarf-regs.c
@@ -0,0 +1,75 @@
+/*
+ * dwarf-regs.c : Mapping of DWARF debug register numbers into register names.
+ * Extracted from probe-finder.c
+ *
+ * Written by Masami Hiramatsu 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This pr

perf: Split out arch specific code & improve PowerPC perf probe support

2010-04-13 Thread Ian Munsie
These patches add the required mappings to use perf probe on PowerPC.

Functionality wise it requires the patch titled "powerpc: Add kprobe-based
event tracer" from the powerpc-next tree to provide the
HAVE_REGS_AND_STACK_ACCESS_API required for CONFIG_KPROBE_EVENT. The code will
still compile cleanly without it and will fail gracefully at runtime on the
missing CONFIG_KPROBE_EVENT support as before.

Part 1 of the patch series moves the arch dependent x86 32 and 64 bit DWARF
register number mappings out into a separate arch directory and adds the
necessary Makefile foo to use it.

Part 2 of the patch series adds the PowerPC mappings.


Changes since v1: From Masami Hiramatsu's suggestion, I added a check in the
Makefile for if the arch specific Makefile defines PERF_HAVE_DWARF_REGS,
printing a message during build if it has not. This simplifies the code
removing the odd macro from the previous version and the need for an arch
specific arch_dwarf-regs.h. I have not entirely disabled DWARF support for
architectures that don't implement the register mappings, so that they can
still add a probe based on a line number (they will be missing the ability to
capture the value of a variable from a register).


Thanks,
-Ian


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

2010-04-13 Thread Michael Neuling
In message <1271161766.4807.1280.ca...@twins> you wrote:
> On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> > When calculating capacity we use the following calculation:
> >=20
> >capacity =3D DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
> >=20
> > In SMT2, power will be 1178/2 (provided we are not scaling power with
> > freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity
> > being 1.
> >=20
> > With SMT4 however, power will be 1178/4, hence capacity will end up as
> > 0.
> >=20
> > Fix this by ensuring capacity is always at least 1 after this
> > calculation.
> >=20
> > Signed-off-by: Michael Neuling 
> > ---
> > I'm not sure this is the correct fix but this works for me. =20
> 
> Right, so I suspect this will indeed break some things.
> 
> We initially allowed 0 capacity for when a cpu is consumed by an RT task
> and there simply isn't much capacity left, in that case you really want
> to try and move load to your sibling cpus if possible.

Changing the CPU power based on what tasks are running on them seems a
bit wrong to me.  Shouldn't we keep those concepts separate?

> However you're right that this goes awry in your case.
> 
> One thing to look at is if that 15% increase is indeed representative
> for the power7 cpu, it having 4 SMT threads seems to suggest there was
> significant gains, otherwise they'd not have wasted the silicon.

There are certainly, for most workloads, per core gains for SMT4 over
SMT2 on P7.  My kernels certainly compile faster and that's the only
workload anyone who matters cares about ;-)

> (The broken x86 code was meant to actually compute the SMT gain, so that
> we'd not have to guess the 15%)
> 
> Now, increasing this will only marginally fix the issue, since if you
> end up with 512 per thread it only takes a very tiny amount of RT
> workload to drop below and end up at 0 again.

I tried initialled to make smt_gain programmable and at 2048 per core
(512 per thread), the packing became unstable, as you intimated.

> One thing we could look at is using the cpu base power to compute
> capacity from. We'd have to add another field to sched_group and store
> power before we do the scale_rt_power() stuff.

Separating capacity from what RT tasks are running seems like a good
idea to me.

This would fix the RT issue, but it's not clear to me how you are
suggesting fixing the rounding down to 0 SMT4 issue.  Are you suggesting
we bump smt_gain to say 2048 + 15%?  Or are you suggesting we separate
the RT tasks out from capacity and keep the max(1, capacity) that I've
added?  Or something else?

Would another possibility be changing capacity a scaled value (like
cpu_power is now) rather than a small integer as it is now.  For
example, a scaled capacity of 1024 would be equivalent to a capacity of
1 now.  This might enable us to handle partial capacities better?  We'd
probably have to scale a bunch of nr_running too.  

Mikey

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [Patch 1/1] PPC64-HWBKPT: Implement hw-breakpoints for PPC64

2010-04-13 Thread K.Prasad
On Wed, Apr 07, 2010 at 06:03:31PM +1000, Benjamin Herrenschmidt wrote:
> Ok so too many problems with your last patch, I didn't have time to fix
> them all, so it's not going into -next this week.
> 
> Please, test with a variety of defconfigs (iseries, cell, g5 for
> example), and especially with CONFIG_PERF_EVENTS not set. There are
> issues in the generic header for that (though I'm told some people are
> working on a fix).
> 
> Basically, we need something like CONFIG_HW_BREAKPOINTS that is set
> (maybe optionally, maybe not) with CONFIG_PERF_EVENTS and
> CONFIG_HAVE_HW_BREAKPOINT. Then you can use that to properly fix the
> ifdef'ing in include/linux/hw_breakpoints.h, and fix the various other
> cases in powerpc code where you are testing for the wrong thing.
> 
> Cheers,
> Ben.
>

Hi Ben,
I've sent a new patch (linuxppc-dev message-id ref:
20100414034340.ga6...@in.ibm.com) that builds against the defconfigs for
various architectures pointed out by you (I did see quite a few errors
that aren't induced by the patch).

The source tree is buildable even without CONFIG_PERF_EVENTS, and is
limited to scope using CONFIG_HAVE_HW_BREAKPOINT. At this stage, I
didnot find a need for a seperate CONFIG_HW_BREAKPOINTS though.

Let me know what you think.

Thanks,
K.Prasad

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[Patch 2/2] PPC64-HWBKPT: Implement hw-breakpoints for PPC64

2010-04-13 Thread K.Prasad
Implement perf-events based hw-breakpoint interfaces for PPC64 processors.
These interfaces help arbitrate requests from various users and schedules
them as appropriate.

Signed-off-by: K.Prasad 
---
 arch/powerpc/Kconfig |1 
 arch/powerpc/include/asm/cputable.h  |4 
 arch/powerpc/include/asm/hw_breakpoint.h |   45 
 arch/powerpc/include/asm/processor.h |8 
 arch/powerpc/kernel/Makefile |1 
 arch/powerpc/kernel/hw_breakpoint.c  |  324 +++
 arch/powerpc/kernel/machine_kexec_64.c   |3 
 arch/powerpc/kernel/process.c|6 
 arch/powerpc/kernel/ptrace.c |   81 +++
 arch/powerpc/lib/Makefile|1 
 include/linux/hw_breakpoint.h|1 
 11 files changed, 475 insertions(+)

Index: linux-2.6.ppc64_test/arch/powerpc/include/asm/hw_breakpoint.h
===
--- /dev/null
+++ linux-2.6.ppc64_test/arch/powerpc/include/asm/hw_breakpoint.h
@@ -0,0 +1,45 @@
+#ifndef _PPC_BOOK3S_64_HW_BREAKPOINT_H
+#define _PPC_BOOK3S_64_HW_BREAKPOINT_H
+
+#ifdef __KERNEL__
+#define__ARCH_HW_BREAKPOINT_H
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+
+struct arch_hw_breakpoint {
+   u8  len; /* length of the target data symbol */
+   int type;
+   unsigned long   address;
+};
+
+#include 
+#include 
+#include 
+
+struct perf_event;
+struct pmu;
+struct perf_sample_data;
+
+#define HW_BREAKPOINT_ALIGN 0x7
+/* Maximum permissible length of any HW Breakpoint */
+#define HW_BREAKPOINT_LEN 0x8
+
+extern int arch_validate_hwbkpt_settings(struct perf_event *bp,
+   struct task_struct *tsk);
+extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
+   unsigned long val, void *data);
+int arch_install_hw_breakpoint(struct perf_event *bp);
+void arch_uninstall_hw_breakpoint(struct perf_event *bp);
+void hw_breakpoint_pmu_read(struct perf_event *bp);
+extern void flush_ptrace_hw_breakpoint(struct task_struct *tsk);
+
+extern struct pmu perf_ops_bp;
+extern void ptrace_triggered(struct perf_event *bp, int nmi,
+   struct perf_sample_data *data, struct pt_regs *regs);
+static inline void hw_breakpoint_disable(void)
+{
+   set_dabr(0);
+}
+
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
+#endif /* __KERNEL__ */
+#endif /* _PPC_BOOK3S_64_HW_BREAKPOINT_H */
Index: linux-2.6.ppc64_test/arch/powerpc/kernel/hw_breakpoint.c
===
--- /dev/null
+++ linux-2.6.ppc64_test/arch/powerpc/kernel/hw_breakpoint.c
@@ -0,0 +1,324 @@
+/*
+ * HW_breakpoint: a unified kernel/user-space hardware breakpoint facility,
+ * using the CPU's debug registers. Derived from
+ * "arch/x86/kernel/hw_breakpoint.c"
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright 2009 IBM Corporation
+ * Author: K.Prasad 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+/*
+ * Store the 'bp' that caused the hw-breakpoint exception just before we
+ * single-step. Use it to distinguish a single-step exception (due to a
+ * previous hw-breakpoint exception) from a normal one
+ */
+static DEFINE_PER_CPU(struct perf_event *, last_hit_bp);
+
+/*
+ * Stores the breakpoints currently in use on each breakpoint address
+ * register for every cpu
+ */
+static DEFINE_PER_CPU(struct perf_event *, bp_per_reg);
+
+/*
+ * Install a perf counter breakpoint.
+ *
+ * We seek a free debug address register and use it for this
+ * breakpoint.
+ *
+ * Atomic: we hold the counter->ctx->lock and we only handle variables
+ * and registers local to this cpu.
+ */
+int arch_install_hw_breakpoint(struct perf_event *bp)
+{
+   struct arch_hw_breakpoint *info = counter_arch_bp(bp);
+   struct perf_event **slot = &__get_cpu_var(bp_per_reg);
+
+   *slot = bp;
+   set_dabr(info->address | info->type | DABR_TRANSLATION);
+   return 0;
+}
+
+/*
+ * Uninstall the breakpoint contained in the given counter.
+ *
+ * First we search the debug address register it uses and then we disable
+ * it.
+ *
+ * Atomic: we hold the

[Patch 1/2] PPC64-HWBKPT: Disable interrupts for data breakpoint exceptions

2010-04-13 Thread K.Prasad
Data address breakpoint exceptions are currently handled along with page-faults
which require interrupts to remain in enabled state. Since exception handling
for data breakpoints aren't pre-empt safe, we handle them separately.

Signed-off-by: K.Prasad 
---
 arch/powerpc/kernel/exceptions-64s.S |   13 -
 arch/powerpc/mm/fault.c  |5 +++--
 2 files changed, 15 insertions(+), 3 deletions(-)

Index: linux-2.6.ppc64_test/arch/powerpc/kernel/exceptions-64s.S
===
--- linux-2.6.ppc64_test.orig/arch/powerpc/kernel/exceptions-64s.S
+++ linux-2.6.ppc64_test/arch/powerpc/kernel/exceptions-64s.S
@@ -735,8 +735,11 @@ _STATIC(do_hash_page)
std r3,_DAR(r1)
std r4,_DSISR(r1)
 
-   andis.  r0,r4,0xa450/* weird error? */
+   andis.  r0,r4,0xa410/* weird error? */
bne-handle_page_fault   /* if not, try to insert a HPTE */
+   andis.  r0,r4,dsisr_dabrma...@h
+   bne-handle_dabr_fault
+
 BEGIN_FTR_SECTION
andis.  r0,r4,0x0020/* Is it a segment table fault? */
bne-do_ste_alloc/* If so handle it */
@@ -823,6 +826,14 @@ END_FW_FTR_SECTION_IFCLR(FW_FEATURE_ISER
bl  .raw_local_irq_restore
b   11f
 
+/* We have a data breakpoint exception - handle it */
+handle_dabr_fault:
+   ld  r4,_DAR(r1)
+   ld  r5,_DSISR(r1)
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  .do_dabr
+   b   .ret_from_except_lite
+
 /* Here we have a page fault that hash_page can't handle. */
 handle_page_fault:
ENABLE_INTS
Index: linux-2.6.ppc64_test/arch/powerpc/mm/fault.c
===
--- linux-2.6.ppc64_test.orig/arch/powerpc/mm/fault.c
+++ linux-2.6.ppc64_test/arch/powerpc/mm/fault.c
@@ -151,13 +151,14 @@ int __kprobes do_page_fault(struct pt_re
if (!user_mode(regs) && (address >= TASK_SIZE))
return SIGSEGV;
 
-#if !(defined(CONFIG_4xx) || defined(CONFIG_BOOKE))
+#if !(defined(CONFIG_4xx) || defined(CONFIG_BOOKE) || \
+defined(CONFIG_PPC_BOOK3S_64))
if (error_code & DSISR_DABRMATCH) {
/* DABR match */
do_dabr(regs, address, error_code);
return 0;
}
-#endif /* !(CONFIG_4xx || CONFIG_BOOKE)*/
+#endif
 
if (in_atomic() || mm == NULL) {
if (!user_mode(regs))

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[Patch 0/2] PPC64-HWBKPT: Hardware Breakpoint interfaces - ver XVII

2010-04-13 Thread K.Prasad
Hi Ben,
Please find a new version of the patchset. It contains small, yet
significant number of changes to address the build issues you pointed out
(details of which are listed in the changelog below).

Changelog - ver XVII

(Version XVI: linuxppc-dev ref: 20100330095809.ga14...@in.ibm.com)
- CONFIG_HAVE_HW_BREAKPOINT is now used to define the scope of the new code
  (in lieu of CONFIG_PPC_BOOK3S_64).
- CONFIG_HAVE_HW_BREAKPOINT is now dependant upon CONFIG_PERF_EVENTS and
  CONFIG_PPC_BOOK3S_64 (to overcome build failures under certain configs).
- Included a target in arch/powerpc/lib/Makefile to build sstep.o when
  HAVE_HW_BREAKPOINT.
- Added a dummy definition for hw_breakpoint_disable() when !HAVE_HW_BREAKPOINT.
- Tested builds under defconfigs for ppc64, cell and g5. Found no patch
  induced failures.

Kindly accept them to be a part of -next tree.

Thanks,
K.Prasad


Changelog - ver XVI

(Version XV: linuxppc-dev ref: 20100323140639.ga21...@in.ibm.com)
- Used a new config option CONFIG_PPC_BOOK3S_64 (in lieu of
  CONFIG_PPC64/CPU_FTR_HAS_DABR) to limit the scope of the new code.
- Disabled breakpoints before kexec of the machine using 
hw_breakpoint_disable().
- Minor optimisation in exception-64s.S to check for data breakpoint exceptions
  in DSISR finally (after check for other causes) + changes in code comments 
and 
  representation of DSISR_DABRMATCH constant.
- Rebased to commit ae6be51ed01d6c4aaf249a207b4434bc7785853b of linux-2.6.

Changelog - ver XV

(Version XIV: linuxppc-dev ref: 20100308181232.ga3...@in.ibm.com)

- Additional patch to disable interrupts during data breakpoint exception
  handling.
- Moved HBP_NUM definition to cputable.h under a new CPU_FTR definition
  (CPU_FTR_HAS_DABR).
- Filtering of extraneous exceptions (due to accesses outside symbol length) is
  by-passed for ptrace requests.
- Removed flush_ptrace_hw_breakpoint() from __switch_to() due to incorrect
  coding placement.
- Changes to code comments as per community reviews for previous version.
- Minor coding-style changes in hw_breakpoint.c as per review comments.
- Re-based to commit ae6be51ed01d6c4aaf249a207b4434bc7785853b of linux-2.6

Changelog - ver XIV

(Version XIII: linuxppc-dev ref: 20100215055605.gb3...@in.ibm.com)

- Removed the 'name' field from 'struct arch_hw_breakpoint'.
- All callback invocations through bp->overflow_handler() are replaced with
  perf_bp_event().
- Removed the check for pre-existing single-stepping mode in
  hw_breakpoint_handler() as this check is unreliable while in kernel-space.
  Side effect of this change is the non-triggering of hw-breakpoints while
  single-stepping kernel through KGDB or Xmon.
- Minor code-cleanups and addition of comments in hw_breakpoint_handler() and
  single_step_dabr_instruction().
- Re-based to commit 25cf84cf377c0aae5dbcf937ea89bc7893db5176 of linux-2.6

Changelog - ver XIII

(Version XII: linuxppc-dev ref: 20100121084640.ga3...@in.ibm.com)

- Fixed a bug for user-space breakpoints (triggered following the failure of a
  breakpoint request).
- Re-based on commit 724e6d3fe8003c3f60bf404bf22e4e331327c596 of linux-2.6
  
Changelog - ver XII

(Version XI: linuxppc-dev ref: 20100119091234.ga9...@in.ibm.com)

- Unset MSR_SE only if kernel was not previously in single-step mode.
- Pre-emption is now enabled before returning from the hw-breakpoint exception
  handler.
- Variables to track the source of single-step exception (breakpoint from 
kernel,
  user-space vs single-stepping due to other requests) are added.
- Extraneous hw-breakpoint exceptions (due to memory accesses lying outside
  monitored symbol length) is now done for both kernel and user-space
  (previously only user-space).
- single_step_dabr_instruction() now returns NOTIFY_DONE if kernel was in
  single-step mode even before the hw-breakpoint. This enables other users of
  single-step mode to be notified of the exception.
- User-space instructions are not emulated from kernel-space, they are instead
  single-stepped.
  
Changelog - ver XI
--
(Version X: linuxppc-dev ref: 20091211160144.ga23...@in.ibm.com)
- Conditionally unset MSR_SE in the single-step handler
- Added comments to explain the duration and need for pre-emption
disable following hw-breakpoint exception.

Changelog - ver X
--
- Re-write the PPC64 patches for the new implementation of hw-breakpoints that
  uses the perf-layer.
- Rebased to commit 7622fc234190a37d4e9fe3ed944a2b61a63fca03 of -tip.

Changelog - ver IX
---
- Invocation of user-defined callback will be 'trigger-after-execute' (except
  for ptrace).
- Creation of a new global per-CPU breakpoint structure to help invocation of
  user-defined callback from single-step handler.
(Changes made now)
- Validation before registration will fail only if the address does not match
  the kernel symbol's (if s

Re: Possible bug with mutex adaptative spinning

2010-04-13 Thread Benjamin Herrenschmidt
On Wed, 2010-04-14 at 12:35 +1000, Benjamin Herrenschmidt wrote:
> Hi Peter !
> 
> I -may- have found a bug with mutex adaptative spinning. We hit it when
> torture testing CPU unplug.

 .../...

In fact, I wonder if there's another potential problem:

If the owner is actually running, it may do so for a very long time. It
looks to me that everybody trying to take the mutex will thus spin and
never get out of the spin loop until the owner stops running.

That sounds very wrong to me :-)

Shouldn't you at least remove the test for !owner when testing
need_resched() in __mutex_lock_common() ?

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Possible bug with mutex adaptative spinning

2010-04-13 Thread Benjamin Herrenschmidt
Hi Peter !

I -may- have found a bug with mutex adaptative spinning. We hit it when
torture testing CPU unplug.

Basically, what happens is mutex_spin_on_owner() returns 1 if the owner
CPU is offline. That means that the caller (__mutex_lock_common()) will
spin until the mutex is released since there's a valid owner (so the
need_resched() test doesn't trigger). This will deadlock if this is the
only remaining CPU online.

In fact, it even deadlocks with more than 1 CPU online, for example, I
have a case where the 2 remaining online CPUs got into
__mutex_lock_common() at the same time, and so got into that spin loop
before any of them added itself to the wait_list, thus never hitting the
exist condition there.

I believe your test against nr_cpumask_bits is also a bit fragile for
similar reasons. IE. You have what looks like a valid owner but it seems
to be on an invalid CPU. It could be a freed thread_info, in which case
returning 1 is fine, but if it's anything else, kaboom. I think it's
better to be safe than sorry here and just go to sleep (ie return 0).

Same comment with the DEBUG_PAGE_ALLOC case. In fact, this (untested)
patch makes the function simpler and I don't think will have any
negative effect on performances. Let me know what you think:

 cut here 

[PATCH] mutex: Don't spin when the owner CPU is offline or other weird cases

The current code might spin forever if the CPU owning the mutex has been
offlined, and the last CPU in the system is trying to acquire it, since
mutex_spin_on_owner() will always return 1, telling the caller to spin
until the mutex has been released.

This patch changes mutex_spin_on_owner() to return 0 (don't spin) in any
case where we aren't sure about the owner struct validity or CPU number,
and if the said CPU is offline. There is no point going back &
re-evaluate spinning in corner cases like that, let's just go to sleep.

Signed-off-by: Benjamin Herrenschmidt 
---

diff --git a/kernel/sched.c b/kernel/sched.c
index a3dff1f..11b7f4a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3780,7 +3780,7 @@ int mutex_spin_on_owner(struct mutex *lock, struct
thread_info *owner)
 * the mutex owner just released it and exited.
 */
if (probe_kernel_address(&owner->cpu, cpu))
-   goto out;
+   return 0;
 #else
cpu = owner->cpu;
 #endif
@@ -3788,16 +3788,21 @@ int mutex_spin_on_owner(struct mutex *lock,
struct thread_info *owner)
/*
 * Even if the access succeeded (likely case),
 * the cpu field may no longer be valid.
+*
+* We stop spinning in this case since the owner may never
 */
if (cpu >= nr_cpumask_bits)
-   goto out;
+   return 0;
 
/*
 * We need to validate that we can do a
 * get_cpu() and that we have the percpu area.
+*
+* If the CPU is offline, the owner will never release it so
+* we must not spin
 */
if (!cpu_online(cpu))
-   goto out;
+   return 0;
 
rq = cpu_rq(cpu);
 
@@ -3816,7 +3821,6 @@ int mutex_spin_on_owner(struct mutex *lock, struct
thread_info *owner)
 
cpu_relax();
}
-out:
return 1;
 }
 #endif


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 5/5] sched: make fix_small_imbalance work with asymmetric packing

2010-04-13 Thread Suresh Siddha
On Tue, 2010-04-13 at 05:29 -0700, Peter Zijlstra wrote:
> On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> > With the asymmetric packing infrastructure, fix_small_imbalance is
> > causing idle higher threads to pull tasks off lower threads.  
> > 
> > This is being caused by an off-by-one error.  
> > 
> > Signed-off-by: Michael Neuling 
> > ---
> > I'm not sure this is the right fix but without it, higher threads pull
> > tasks off the lower threads, then the packing pulls it back down, etc
> > etc and tasks bounce around constantly.
> 
> Would help if you expand upon the why/how it manages to get pulled up.
> 
> I can't immediately spot anything wrong with the patch, but then that
> isn't my favourite piece of code either.. Suresh, any comments?
> 

Sorry didn't pay much attention to this patchset. But based on the
comments from Michael and looking at this patchset, it has SMT/MC
implications. I will review and run some tests and get back in a day.

As far as this particular patch is concerned, original code is coming
from Ingo's original CFS code commit (dd41f596) and the below hunk
pretty much explains what the change is about.

-   if (max_load - this_load >= busiest_load_per_task * imbn) {
+   if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
+   busiest_load_per_task * imbn) {

So the below proposed change will probably break what the above
mentioned commit was trying to achieve, which is: for fairness reasons
we were bouncing the small extra load (between the max_load and
this_load) around.

> > ---
> > 
> >  kernel/sched_fair.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > Index: linux-2.6-ozlabs/kernel/sched_fair.c
> > ===
> > --- linux-2.6-ozlabs.orig/kernel/sched_fair.c
> > +++ linux-2.6-ozlabs/kernel/sched_fair.c
> > @@ -2652,7 +2652,7 @@ static inline void fix_small_imbalance(s
> >  * SCHED_LOAD_SCALE;
> > scaled_busy_load_per_task /= sds->busiest->cpu_power;
> >  
> > -   if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
> > +   if (sds->max_load - sds->this_load + scaled_busy_load_per_task >
> > (scaled_busy_load_per_task * imbn)) {
> > *imbalance = sds->busiest_load_per_task;
> > return;
> 

thanks,
suresh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] powerpc: Track backing pages allocated by vmemmap_populate()

2010-04-13 Thread Mark Nelson
On Tuesday 13 April 2010 21:16:44 Benjamin Herrenschmidt wrote:
> On Tue, 2010-04-13 at 16:02 +1000, Mark Nelson wrote:
> > That's a good question, and one that I probably should have added to
> > the
> > commit message.
> > But, following through, it looks like we end up calling into
> > __remove_section() from mm/memory_hotplug.c and if
> > CONFIG_SPARSEMEM_VMEMMAP is enabled we just return EBUSY as freeing
> > memmap with vmemmap isn't implemented yet.
> > 
> > So for the moment, I'm not sure we have to worry about it.
> 
> We probably don't. IE. The vmemmap will remain for those struct pages,
> which means they won't need to be allocated again if some memory is
> plugged back there. If not, we just waste a bit of memory. Not a big
> deal.
> 
Excellent! That makes sense.

Thanks Ben!

Mark.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH v3 3/6] RapidIO: Add Port-Write handling for EM

2010-04-13 Thread Bounine, Alexandre
Andrew Morton wrote:
> On Tue, 6 Apr 2010 17:22:38 -0400 Alexandre Bounine
 wrote:
> 
> >
> > From: Alexandre Bounine 
> >
> > Add RapidIO Port-Write message handling in the context
> > of Error Management Extensions Specification Rev.1.3.
> >
> > ...
> >
> > +static struct rio_dev *rio_get_comptag(u32 comp_tag, struct rio_dev
*from)
> > +{
> > +   struct list_head *n;
> > +   struct rio_dev *rdev;
> > +
> > +   WARN_ON(in_interrupt());
> 
> The check should be unneeded - lockdep will warn about this.

Thank you. I will remove it in my next set of patches. 

> 
> > +   spin_lock(&rio_global_list_lock);
> > +   n = from ? from->global_list.next : rio_devices.next;
> > +
> > +   while (n && (n != &rio_devices)) {
> > +   rdev = rio_dev_g(n);
> > +   if (rdev->comp_tag == comp_tag)
> > +   goto exit;
> > +   n = n->next;
> > +   }
> > +   rdev = NULL;
> > +exit:
> > +   spin_unlock(&rio_global_list_lock);
> > +   return rdev;
> > +}
> > +

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 2/2] [V5] Add non-Virtex5 support for LL TEMAC driver

2010-04-13 Thread John Linn
> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Tuesday, April 13, 2010 2:34 AM
> To: grant.lik...@secretlab.ca
> Cc: John Linn; net...@vger.kernel.org; linuxppc-...@ozlabs.org;
jwbo...@linux.vnet.ibm.com;
> eric.duma...@gmail.com; john.willi...@petalogix.com;
michal.si...@petalogix.com; jty...@cs.ucr.edu
> Subject: Re: [PATCH 2/2] [V5] Add non-Virtex5 support for LL TEMAC
driver
> 
> From: Grant Likely 
> Date: Fri, 9 Apr 2010 12:10:21 -0
> 
> > On Thu, Apr 8, 2010 at 11:08 AM, John Linn 
wrote:
> >> This patch adds support for using the LL TEMAC Ethernet driver on
> >> non-Virtex 5 platforms by adding support for accessing the Soft DMA
> >> registers as if they were memory mapped instead of solely through
the
> >> DCR's (available on the Virtex 5).
> >>
> >> The patch also updates the driver so that it runs on the
MicroBlaze.
> >> The changes were tested on the PowerPC 440, PowerPC 405, and the
> >> MicroBlaze platforms.
> >>
> >> Signed-off-by: John Tyner 
> >> Signed-off-by: John Linn 
> >
> > Picked up and build tested both patches on 405, 440, 60x and ppc64.
> > No build problems found either built-in or as a module.
> >
> > for both:
> > Acked-by: Grant Likely 
> 
> Ok, both applied to net-next-2.6, thanks everyone for sorting this
> out.

Great! Thanks David, appreciate the help.

This email and any attachments are intended for the sole use of the named 
recipient(s) and contain(s) confidential information that may be proprietary, 
privileged or copyrighted under applicable law. If you are not the intended 
recipient, do not read, copy, or forward this email message or any attachments. 
Delete this email message and any attachments immediately.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: Upgrade to 2.6.26 results in Oops during request_irq

2010-04-13 Thread Sparks, Sam
>From: Sparks, Sam
>Sent: Thursday, April 08, 2010 4:15 PM
>
>Howdy All,
>
>I have (almost) successfully upgraded from Linux 2.6.22 to 2.6.26 (both
>downloaded from debian) on my mpc8347 powerpc, but I think I may be
>missing a required change to my dts regarding the IPIC or maybe a
change
>in how my driver requests IRQs.
>
>I have several modules that are maintained outside the kernel tree, and
>all but one is working correctly. The offending driver is attempting to
>register IRQ 23, and is accessing an invalid memory location. I am
>currently getting the following stack dump:

Aha! I have found that change that caused my module to break:
http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg09748.html

This patch modified ipic_set_irq_type to override the handle_irq
function pointer.

Do I need to register a new function to handle falling edge triggered
external interrupts? It appears the default is NULL.

--Sam


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/5] sched: add asymmetric packing option for sibling domain

2010-04-13 Thread Peter Zijlstra
On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> Peter: Since this is based mainly off your initial patch, it should
> have your signed-off-by too, but I didn't want to add without your
> permission.  Can I add it?

Of course! :-)

This thing does need a better changelog though, and maybe a larger
comment with check_asym_packing(), explaining why and what we're doing
and what we're assuming (that lower cpu number also means lower thread
number).
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 4/5] sched: Mark the balance type for use in need_active_balance()

2010-04-13 Thread Peter Zijlstra
On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> need_active_balance() gates the asymmetric packing based due to power
> save logic, but for packing we don't care.

This explanation lacks a how/why.

So the problem is that need_active_balance() ends up returning false and
prevents the active balance from pulling a task to a lower available SMT
sibling?

> This marks the type of balanace we are attempting to do perform from
> f_b_g() and stops need_active_balance() power save logic gating a
> balance in the asymmetric packing case.

At the very least this wants more comments in the code. I'm not really
charmed by having to add yet another variable to pass around that mess,
but I can't seem to come up with something cleaner either.

> Signed-off-by: Michael Neuling 
> 
> ---
> 
>  kernel/sched_fair.c |   24 +++-
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6-ozlabs/kernel/sched_fair.c
> ===
> --- linux-2.6-ozlabs.orig/kernel/sched_fair.c
> +++ linux-2.6-ozlabs/kernel/sched_fair.c
> @@ -91,6 +91,13 @@ const_debug unsigned int sysctl_sched_mi
>  
>  static const struct sched_class fair_sched_class;
>  
> +enum balance_type {
> + BALANCE_NONE = 0,
> + BALANCE_LOAD,
> + BALANCE_POWER,
> + BALANCE_PACKING
> +};
> +
>  /**
>   * CFS operations on generic schedulable entities:
>   */
> @@ -2783,7 +2790,8 @@ static inline void calculate_imbalance(s
>  static struct sched_group *
>  find_busiest_group(struct sched_domain *sd, int this_cpu,
>  unsigned long *imbalance, enum cpu_idle_type idle,
> -int *sd_idle, const struct cpumask *cpus, int *balance)
> +int *sd_idle, const struct cpumask *cpus, int *balance,
> +enum balance_type *bt)
>  {
>   struct sd_lb_stats sds;
>  
> @@ -2808,6 +2816,7 @@ find_busiest_group(struct sched_domain *
>   if (!(*balance))
>   goto ret;
>  
> + *bt = BALANCE_PACKING;
>   if ((idle == CPU_IDLE || idle == CPU_NEWLY_IDLE) &&
>   check_asym_packing(sd, &sds, this_cpu, imbalance))
>   return sds.busiest;
> @@ -2828,6 +2837,7 @@ find_busiest_group(struct sched_domain *
>  
>   /* Looks like there is an imbalance. Compute it */
>   calculate_imbalance(&sds, this_cpu, imbalance);
> + *bt = BALANCE_LOAD;
>   return sds.busiest;
>  
>  out_balanced:
> @@ -2835,10 +2845,12 @@ out_balanced:
>* There is no obvious imbalance. But check if we can do some balancing
>* to save power.
>*/
> + *bt = BALANCE_POWER;
>   if (check_power_save_busiest_group(&sds, this_cpu, imbalance))
>   return sds.busiest;
>  ret:
>   *imbalance = 0;
> + *bt = BALANCE_NONE;
>   return NULL;
>  }
>  
> @@ -2899,9 +2911,10 @@ find_busiest_queue(struct sched_group *g
>  /* Working cpumask for load_balance and load_balance_newidle. */
>  static DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
>  
> -static int need_active_balance(struct sched_domain *sd, int sd_idle, int 
> idle)
> +static int need_active_balance(struct sched_domain *sd, int sd_idle, int 
> idle,
> +enum balance_type *bt)
>  {
> - if (idle == CPU_NEWLY_IDLE) {
> + if (idle == CPU_NEWLY_IDLE && *bt != BALANCE_PACKING) {
>   /*
>* The only task running in a non-idle cpu can be moved to this
>* cpu in an attempt to completely freeup the other CPU
> @@ -2946,6 +2959,7 @@ static int load_balance(int this_cpu, st
>   struct rq *busiest;
>   unsigned long flags;
>   struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
> + enum balance_type bt;
>  
>   cpumask_copy(cpus, cpu_active_mask);
>  
> @@ -2964,7 +2978,7 @@ static int load_balance(int this_cpu, st
>  redo:
>   update_shares(sd);
>   group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
> -cpus, balance);
> +cpus, balance, &bt);
>  
>   if (*balance == 0)
>   goto out_balanced;
> @@ -3018,7 +3032,7 @@ redo:
>   schedstat_inc(sd, lb_failed[idle]);
>   sd->nr_balance_failed++;
>  
> - if (need_active_balance(sd, sd_idle, idle)) {
> + if (need_active_balance(sd, sd_idle, idle, &bt)) {
>   raw_spin_lock_irqsave(&busiest->lock, flags);
>  
>   /* don't kick the migration_thread, if the curr

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

2010-04-13 Thread Peter Zijlstra
On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> When calculating capacity we use the following calculation:
> 
>capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
> 
> In SMT2, power will be 1178/2 (provided we are not scaling power with
> freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity
> being 1.
> 
> With SMT4 however, power will be 1178/4, hence capacity will end up as
> 0.
> 
> Fix this by ensuring capacity is always at least 1 after this
> calculation.
> 
> Signed-off-by: Michael Neuling 
> ---
> I'm not sure this is the correct fix but this works for me.  

Right, so I suspect this will indeed break some things.

We initially allowed 0 capacity for when a cpu is consumed by an RT task
and there simply isn't much capacity left, in that case you really want
to try and move load to your sibling cpus if possible.

However you're right that this goes awry in your case.

One thing to look at is if that 15% increase is indeed representative
for the power7 cpu, it having 4 SMT threads seems to suggest there was
significant gains, otherwise they'd not have wasted the silicon.

(The broken x86 code was meant to actually compute the SMT gain, so that
we'd not have to guess the 15%)

Now, increasing this will only marginally fix the issue, since if you
end up with 512 per thread it only takes a very tiny amount of RT
workload to drop below and end up at 0 again.

One thing we could look at is using the cpu base power to compute
capacity from. We'd have to add another field to sched_group and store
power before we do the scale_rt_power() stuff.

Thoughts?


>  kernel/sched_fair.c |6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6-ozlabs/kernel/sched_fair.c
> ===
> --- linux-2.6-ozlabs.orig/kernel/sched_fair.c
> +++ linux-2.6-ozlabs/kernel/sched_fair.c
> @@ -1482,6 +1482,7 @@ static int select_task_rq_fair(struct ta
>   }
>  
>   capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
> + capacity = max(capacity, 1UL);
>  
>   if (tmp->flags & SD_POWERSAVINGS_BALANCE)
>   nr_running /= 2;
> @@ -2488,6 +2489,7 @@ static inline void update_sg_lb_stats(st
>  
>   sgs->group_capacity =
>   DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
> + sgs->group_capacity = max(sgs->group_capacity, 1UL);
>  }
>  
>  /**
> @@ -2795,9 +2797,11 @@ find_busiest_queue(struct sched_group *g
>  
>   for_each_cpu(i, sched_group_cpus(group)) {
>   unsigned long power = power_of(i);
> - unsigned long capacity = DIV_ROUND_CLOSEST(power, 
> SCHED_LOAD_SCALE);
> + unsigned long capacity;
>   unsigned long wl;
>  
> + capacity = max(DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE), 1UL);
> +
>   if (!cpumask_test_cpu(i, cpus))
>   continue;
>  

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 5/5] sched: make fix_small_imbalance work with asymmetric packing

2010-04-13 Thread Peter Zijlstra
On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> With the asymmetric packing infrastructure, fix_small_imbalance is
> causing idle higher threads to pull tasks off lower threads.  
> 
> This is being caused by an off-by-one error.  
> 
> Signed-off-by: Michael Neuling 
> ---
> I'm not sure this is the right fix but without it, higher threads pull
> tasks off the lower threads, then the packing pulls it back down, etc
> etc and tasks bounce around constantly.

Would help if you expand upon the why/how it manages to get pulled up.

I can't immediately spot anything wrong with the patch, but then that
isn't my favourite piece of code either.. Suresh, any comments?

> ---
> 
>  kernel/sched_fair.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6-ozlabs/kernel/sched_fair.c
> ===
> --- linux-2.6-ozlabs.orig/kernel/sched_fair.c
> +++ linux-2.6-ozlabs/kernel/sched_fair.c
> @@ -2652,7 +2652,7 @@ static inline void fix_small_imbalance(s
>* SCHED_LOAD_SCALE;
>   scaled_busy_load_per_task /= sds->busiest->cpu_power;
>  
> - if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
> + if (sds->max_load - sds->this_load + scaled_busy_load_per_task >
>   (scaled_busy_load_per_task * imbn)) {
>   *imbalance = sds->busiest_load_per_task;
>   return;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] powerpc: Track backing pages allocated by vmemmap_populate()

2010-04-13 Thread Benjamin Herrenschmidt
On Tue, 2010-04-13 at 16:02 +1000, Mark Nelson wrote:
> That's a good question, and one that I probably should have added to
> the
> commit message.
> But, following through, it looks like we end up calling into
> __remove_section() from mm/memory_hotplug.c and if
> CONFIG_SPARSEMEM_VMEMMAP is enabled we just return EBUSY as freeing
> memmap with vmemmap isn't implemented yet.
> 
> So for the moment, I'm not sure we have to worry about it.

We probably don't. IE. The vmemmap will remain for those struct pages,
which means they won't need to be allocated again if some memory is
plugged back there. If not, we just waste a bit of memory. Not a big
deal.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 net-next 2/2] gianfar: Add hardware TX timestamping support

2010-04-13 Thread David Miller
From: Manfred Rudigier 
Date: Fri, 9 Apr 2010 11:10:35 +0200

> If a packet has the skb_shared_tx->hardware flag set the device is
> instructed to generate a TX timestamp and write it back to memory after
> the frame is transmitted. During the clean_tx_ring operation the
> timestamp will be extracted and copied into the skb_shared_hwtstamps
> struct of the skb.
> 
> TX timestamping is enabled by setting the tx_type to something else
> than HWTSTAMP_TX_OFF with the SIOCSHWTSTAMP ioctl command. It is only
> supported by eTSEC devices.
> 
> Signed-off-by: Manfred Rudigier 

Applied to net-next-2.6
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 net-next 1/2] gianfar: Add hardware RX timestamping support

2010-04-13 Thread David Miller
From: Manfred Rudigier 
Date: Fri, 9 Apr 2010 11:10:03 +0200

> The device is configured to insert hardware timestamps into all
> received packets. The RX timestamps are extracted from the padding
> alingment bytes during the clean_rx_ring operation and copied into the
> skb_shared_hwtstamps struct of the skb. This extraction only happens if
> the rx_filter was set to something else than HWTSTAMP_FILTER_NONE with
> the SIOCSHWTSTAMP ioctl command.
> 
> Hardware timestamping is only supported for eTSEC devices. To indicate
> device support the new FSL_GIANFAR_DEV_HAS_TIMER flag was introduced.
> 
> Signed-off-by: Manfred Rudigier 

Applied to net-next-2.6
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/2] perf: Move arch specific code into separate arch directory

2010-04-13 Thread Ian Munsie
Hi Masami,

Thanks for the feedback

Excerpts from Masami Hiramatsu's message of Mon Apr 12 22:37:03 +1000 2010:
> Could you add a check whether the get_arch_regstr() is defined
> (or dwarf-regs.h is exist) in Makefile?
> If it is not defined, we'd better drop dwarf support(so set NO_DWARF),
> because it means we haven't ported perf probe on that architecture yet. :)

I was a little reluctant to do that when I first read your message
because it felt like adding autoconf stuff into the Makefile and the
code should already fail gracefully on architectures without the
register mappings. But, since the Makefile already has some autoconf
like magic and it would make the code cleaner, I'll play with the idea a
bit more tomorrow.

> > diff --git a/tools/perf/arch/x86/include/arch_dwarf-regs.h 
> > b/tools/perf/arch/x86/include/arch_dwarf-regs.h
> > new file mode 100644
> > index 000..9e8da6a
> > --- /dev/null
> > +++ b/tools/perf/arch/x86/include/arch_dwarf-regs.h
> > @@ -0,0 +1,6 @@
> > +#ifndef _PREF_ARCH_PPC_DWARF_REGS_H
> > +#define _PREF_ARCH_PPC_DWARF_REGS_H
> 
> _PREF_ARCH_X86_DWARF_REGS_H ?

Nice catch (oops)


Cheers,
-Ian
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/2] [V5] Add non-Virtex5 support for LL TEMAC driver

2010-04-13 Thread David Miller
From: Grant Likely 
Date: Fri, 9 Apr 2010 12:10:21 -0600

> On Thu, Apr 8, 2010 at 11:08 AM, John Linn  wrote:
>> This patch adds support for using the LL TEMAC Ethernet driver on
>> non-Virtex 5 platforms by adding support for accessing the Soft DMA
>> registers as if they were memory mapped instead of solely through the
>> DCR's (available on the Virtex 5).
>>
>> The patch also updates the driver so that it runs on the MicroBlaze.
>> The changes were tested on the PowerPC 440, PowerPC 405, and the
>> MicroBlaze platforms.
>>
>> Signed-off-by: John Tyner 
>> Signed-off-by: John Linn 
> 
> Picked up and build tested both patches on 405, 440, 60x and ppc64.
> No build problems found either built-in or as a module.
> 
> for both:
> Acked-by: Grant Likely 

Ok, both applied to net-next-2.6, thanks everyone for sorting this
out.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev