Re: Linux 4.9: Reported regressions as of Sunday, 2016-10-30

2016-11-01 Thread Paul Bolle
On Sun, 2016-10-30 at 14:20 +0100, Thorsten Leemhuis wrote:
> As always: Are you aware of any other regressions? Then please let me
> know (simply CC regressi...@leemhuis.info).

Do build regressions count?

Because I was trying to fix an obscure build issue in arch/mips, choose
a random configuration that should hit that issue, and promptly ran
into
https://lkml.kernel.org/r/<201610301405.k82kqqw0%25fengguang...@intel.com>
The same configuration does build under v4.8, I tested that of course.

(Side note: I had to manually insert "25" after "%" to get this to
work. Should Intel fix its mail setup, or should lkml.kernel.org learn
to escape "%"?)

Thanks,


Paul Bolle


Re: [PATCH v2 2/3] cpuidle:powernv: Add helper function to populate powernv idle states.

2016-11-01 Thread Oliver O'Halloran
On Thu, Oct 27, 2016 at 7:35 PM, Gautham R. Shenoy
 wrote:
> From: "Gautham R. Shenoy" 
>
> In the current code for powernv_add_idle_states, there is a lot of code
> duplication while initializing an idle state in powernv_states table.
>
> Add an inline helper function to populate the powernv_states[] table for
> a given idle state. Invoke this for populating the "Nap", "Fastsleep"
> and the stop states in powernv_add_idle_states.
>
> Signed-off-by: Gautham R. Shenoy 
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 82 
> +++
>  include/linux/cpuidle.h   |  1 +
>  2 files changed, 49 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index 7fe442c..11b22b9 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -167,6 +167,28 @@ static int powernv_cpuidle_driver_init(void)
> return 0;
>  }
>
> +static inline void add_powernv_state(int index, const char *name,
> +unsigned int flags,
> +int (*idle_fn)(struct cpuidle_device *,
> +   struct cpuidle_driver *,
> +   int),
> +unsigned int target_residency,
> +unsigned int exit_latency,
> +u64 psscr_val)
> +{
> +   strncpy(powernv_states[index].name, name, CPUIDLE_NAME_LEN);
> +   strncpy(powernv_states[index].desc, name, CPUIDLE_NAME_LEN);

If the supplied name is equal to CPUIDLE_NAME_LEN then strncpy() won't
terminate the string. The least annoying fix is to memset() the whole
structure to zero and set n to CPUIDLE_NAME_LEN - 1.

> +   powernv_states[index].flags = flags;
> +   powernv_states[index].target_residency = target_residency;
> +   powernv_states[index].exit_latency = exit_latency;
> +   powernv_states[index].enter = idle_fn;
> +
> +   if (idle_fn != stop_loop)
> +   return;

This can probably be deleted. The nap/sleep loops don't look at the
psscr setting and you've passed in a dummy value of zero anyway.

> +
> +   stop_psscr_table[index] = psscr_val;
> +}
> +
>  static int powernv_add_idle_states(void)
>  {
> struct device_node *power_mgt;
> @@ -236,6 +258,7 @@ static int powernv_add_idle_states(void)
> "ibm,cpu-idle-state-residency-ns", residency_ns, 
> dt_idle_states);
>
> for (i = 0; i < dt_idle_states; i++) {
> +   unsigned int exit_latency, target_residency;
> /*
>  * If an idle state has exit latency beyond
>  * POWERNV_THRESHOLD_LATENCY_NS then don't use it
> @@ -244,27 +267,30 @@ static int powernv_add_idle_states(void)
> if (latency_ns[i] > POWERNV_THRESHOLD_LATENCY_NS)
> continue;
>
> +   exit_latency = ((unsigned int)latency_ns[i]) / 1000;
> +   target_residency = (!rc) ? ((unsigned int)residency_ns[i]) : 
> 0;
> +   /*
> +* Firmware passes residency values in ns.
> +* cpuidle expects it in us.
> +*/
> +   target_residency /= 1000;
> +
> /*
>  * Cpuidle accepts exit_latency and target_residency in us.

This comment says the same thing as the one above.

>  * Use default target_residency values if f/w does not expose 
> it.
>  */
> if (flags[i] & OPAL_PM_NAP_ENABLED) {
> +   target_residency = 100;

Hmm, the above comment says that we should only use the default value
for target_residency if firmware hasn't provided a value. Is there a
reason we always use the hard coded value?

> /* Add NAP state */
> -   strcpy(powernv_states[nr_idle_states].name, "Nap");
> -   strcpy(powernv_states[nr_idle_states].desc, "Nap");
> -   powernv_states[nr_idle_states].flags = 0;
> -   powernv_states[nr_idle_states].target_residency = 100;
> -   powernv_states[nr_idle_states].enter = nap_loop;
> +   add_powernv_state(nr_idle_states, "Nap",
> + CPUIDLE_FLAG_NONE, nap_loop,
> + target_residency, exit_latency, 0);
> } else if ((flags[i] & OPAL_PM_STOP_INST_FAST) &&
> !(flags[i] & OPAL_PM_TIMEBASE_STOP)) {
> -   strncpy(powernv_states[nr_idle_states].name,
> -   names[i], CPUIDLE_NAME_LEN);
> -   strncpy(powernv_states[nr_idle_states].desc,
> -   names[i], CPUIDLE_NAME_LEN);
> -   powernv

Re: [Patch V6 2/6] irqchip: xilinx: Clean up irqdomain argument and read/write

2016-11-01 Thread Zubair Lutfullah Kakakhel

Hi,

Thanks for the review.

On 10/31/2016 07:51 PM, Thomas Gleixner wrote:

On Mon, 31 Oct 2016, Zubair Lutfullah Kakakhel wrote:

The drivers read/write function handling is a bit quirky.


Can you please explain in more detail what's quirky and why it should be
done differently,


And the irqmask is passed directly to the handler.


I can't make any sense out of that sentence.  Which handler? If you talk
about the write function, then I don't see a change. So what are you
talking about?


Thanks. I'll add more detail in v7 if this patch survives.




Add a new irqchip struct to pass to the handler and
cleanup read/write handling.


I still don't see what it cleans up. You move the write function pointer
into a data structure, which is exposed by another pointer. So you create
two levels of indirection in the hotpath. The function prototype is still
the same. So all this does is making things slower unless I'm missing
something.


I wrote this patch/cleanup based on a review of driver by Marc when I moved the
driver from arch/microblaze to drivers/irqchip

"Marc Zyngier

...

>  arch/microblaze/kernel/intc.c   | 196 

>  drivers/irqchip/irq-axi-intc.c  | 196 


...

> +  /* Yeah, okay, casting the intr_mask to a void* is butt-ugly, but I'm
> +   * lazy and Michal can clean it up to something nicer when he tests
> +   * and commits this patch.  ~~gcl */
> +  root_domain = irq_domain_add_linear(intc, nr_irq, &xintc_irq_domain_ops,
> +  (void *)intr_mask);

Since you're now reworking this driver, how about addressing this
ugliness? You could store the intr_mask together with intc_baseaddr,
and the read/write functions in a global structure, and pass a
pointer to it? That would make the code a bit nicer...
"

https://patchwork.kernel.org/patch/9287933/




-static unsigned int (*read_fn)(void __iomem *);
-static void (*write_fn)(u32, void __iomem *);
+struct xintc_irq_chip {
+   void __iomem *base;
+   struct  irq_domain *domain;
+   struct  irq_chip chip;


The tabs between struct and the structure name are bogus.


+   u32 intr_mask;
+   unsigned int (*read)(void __iomem *iomem);
+   void (*write)(u32 data, void __iomem *iomem);


Please structure that like a table:

void__iomem *base;
struct irq_domain   *domain;
struct irq_chip chip;
u32 intr_mask;
unsigned int(*read)(void __iomem *iomem);
void(*write)(u32 data, void __iomem *iomem);

Can you see how that makes parsing the struct simpler, because the data
types are clearly to identify?


That does make it look much better.




+static struct xintc_irq_chip *xintc_irqc;

 static void intc_write32(u32 val, void __iomem *addr)
 {
@@ -54,6 +60,18 @@ static unsigned int intc_read32_be(void __iomem *addr)
return ioread32be(addr);
 }

+static inline unsigned int xintc_read(struct xintc_irq_chip *xintc_irqc,
+int reg)
+{
+   return xintc_irqc->read(xintc_irqc->base + reg);
+}
+
+static inline void xintc_write(struct xintc_irq_chip *xintc_irqc,
+int reg, u32 data)
+{
+   xintc_irqc->write(data, xintc_irqc->base + reg);
+}
+
 static void intc_enable_or_unmask(struct irq_data *d)
 {
unsigned long mask = 1 << d->hwirq;
@@ -65,21 +83,21 @@ static void intc_enable_or_unmask(struct irq_data *d)
 * acks the irq before calling the interrupt handler
 */
if (irqd_is_level_type(d))
-   write_fn(mask, intc_baseaddr + IAR);
+   xintc_write(xintc_irqc, IAR, mask);


So this whole thing makes only sense, when you want to support multiple
instances of that chip and then you need to store the xintc_irqc pointer as
irqchip data and retrieve it from there. Unless you do that, this "cleanup"
is just churn for nothing with the effect of making things less efficient.



Indeed the driver doesn't support multiple instances of the Xilinx Interrupt 
controller.
I don't have a use-case or the hardware for that.

So what would be the recommended course of action?

Regards,
ZubairLK


Thanks,

tglx



[TRIVIAL][PATCH] Fixup OOPS output

2016-11-01 Thread Balbir Singh

With recent update to printk, we get console output like below

 P   O4.9.0-rc2-next-20161028-3-g23bf57759-dirty #125
 c000ef5f4000
 0001
 P   O (4.9.0-rc2-next-20161028-3-g23bf57759-dirty)
 900010009033
 <
 SF
 ,HV
 ,EE
 ,ME
 ,IR
 ,DR
 ,RI
 ,LE
 >

and more output that is not easy to parse

Fix the OOPS output so that it is easier on the eyes again

Signed-off-by: Balbir Singh 
---
 arch/powerpc/kernel/process.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index ce6dc61..a3193c7 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1282,7 +1282,7 @@ static void print_bits(unsigned long val, struct regbit 
*bits, const char *sep)
 
for (; bits->bit; ++bits)
if (val & bits->bit) {
-   printk("%s%s", s, bits->name);
+   pr_cont("%s%s", s, bits->name);
s = sep;
}
 }
@@ -1305,9 +1305,9 @@ static void print_tm_bits(unsigned long val)
  *   T: Transactional  (bit 34)
  */
if (val & (MSR_TM | MSR_TS_S | MSR_TS_T)) {
-   printk(",TM[");
+   pr_cont(",TM[");
print_bits(val, msr_tm_bits, "");
-   printk("]");
+   pr_cont("]");
}
 }
 #else
@@ -1316,10 +1316,10 @@ static void print_tm_bits(unsigned long val) {}
 
 static void print_msr_bits(unsigned long val)
 {
-   printk("<");
+   pr_cont("<");
print_bits(val, msr_bits, ",");
print_tm_bits(val);
-   printk(">");
+   pr_cont(">");
 }
 
 #ifdef CONFIG_PPC64
@@ -1347,15 +1347,15 @@ void show_regs(struct pt_regs * regs)
printk("  CR: %08lx  XER: %08lx\n", regs->ccr, regs->xer);
trap = TRAP(regs);
if ((regs->trap != 0xc00) && cpu_has_feature(CPU_FTR_CFAR))
-   printk("CFAR: "REG" ", regs->orig_gpr3);
+   pr_cont("CFAR: "REG" ", regs->orig_gpr3);
if (trap == 0x200 || trap == 0x300 || trap == 0x600)
 #if defined(CONFIG_4xx) || defined(CONFIG_BOOKE)
-   printk("DEAR: "REG" ESR: "REG" ", regs->dar, regs->dsisr);
+   pr_cont("DEAR: "REG" ESR: "REG" ", regs->dar, regs->dsisr);
 #else
-   printk("DAR: "REG" DSISR: %08lx ", regs->dar, regs->dsisr);
+   pr_cont("DAR: "REG" DSISR: %08lx ", regs->dar, regs->dsisr);
 #endif
 #ifdef CONFIG_PPC64
-   printk("SOFTE: %ld ", regs->softe);
+   pr_cont("SOFTE: %ld ", regs->softe);
 #endif
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
if (MSR_TM_ACTIVE(regs->msr))
@@ -1364,8 +1364,8 @@ void show_regs(struct pt_regs * regs)
 
for (i = 0;  i < 32;  i++) {
if ((i % REGS_PER_LINE) == 0)
-   printk("\nGPR%02d: ", i);
-   printk(REG " ", regs->gpr[i]);
+   pr_cont("\nGPR%02d: ", i);
+   pr_cont(REG " ", regs->gpr[i]);
if (i == LAST_VOLATILE && !FULL_REGS(regs))
break;
}
@@ -1906,7 +1906,7 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
}
 #endif
if (firstframe)
-   printk(" (unreliable)");
+   pr_cont(" (unreliable)");
printk("\n");
}
firstframe = 0;
-- 
2.5.5



[PATCH 0/3] Enable IAMR storage keys for radix

2016-11-01 Thread Balbir Singh
This series follows up on
https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-August/147840.html

The first patch sets up AMOR in hypervisor mode. AMOR
needs to be setup before IAMR (details of AMOR/IAMR in
each patch). The second patch enables detection of exceptions
generated due to instruction fetch violations caused
and OOPSs' the task. The third patch enables IAMR for
both hypervisor and guest kernels.

I've tested with patch series with a sample hack and
payload.

Chris Smart helped with the series, reviewing and
providing valuable feedback

Cc: Chris Smart 
Cc: Benjamin Herrenschmidt 
Cc: Michael Neuling 
Cc: Aneesh Kumar K.V 
Cc: Paul Mackerras 

Balbir Singh (3):
  Setup AMOR in HV mode
  Detect instruction fetch denied and report
  Enable storage keys for radix - user mode execution

 arch/powerpc/mm/fault.c |  4 
 arch/powerpc/mm/pgtable-radix.c | 39 +++
 2 files changed, 43 insertions(+)

-- 
2.5.5



[PATCH 1/3] Setup AMOR in HV mode

2016-11-01 Thread Balbir Singh
AMOR should be setup in HV mode, we set it up once
and let the generic kernel handle IAMR. This patch is
used to enable storage keys in a following patch as
defined in ISA 3

Reported-by: Aneesh Kumar K.V 
Signed-off-by: Balbir Singh 
---
 arch/powerpc/mm/pgtable-radix.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index ed7bddc..0fdd8ed 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -320,6 +320,25 @@ static void update_hid_for_radix(void)
cpu_relax();
 }
 
+/*
+ * In HV mode, we init AMOR so that the hypervisor
+ * and guest can setup IMAR, enable key 0 and set
+ * it to 1
+ * AMOR = 110000 (Mask for key 0 is 11)
+ */
+static void __init radix_init_amor(void)
+{
+   unsigned long amor_mask = 0xc000;
+   unsigned long amor = mfspr(SPRN_AMOR);
+
+   if (cpu_has_feature(CPU_FTR_POWER9_DD1))
+   return;
+
+   amor |= amor_mask;
+
+   mtspr(SPRN_AMOR, amor);
+}
+
 void __init radix__early_init_mmu(void)
 {
unsigned long lpcr;
@@ -376,6 +395,7 @@ void __init radix__early_init_mmu(void)
lpcr = mfspr(SPRN_LPCR);
mtspr(SPRN_LPCR, lpcr | LPCR_UPRT | LPCR_HR);
radix_init_partition_table();
+   radix_init_amor();
}
 
radix_init_pgtable();
-- 
2.5.5



[PATCH 2/3] Detect instruction fetch denied and report

2016-11-01 Thread Balbir Singh
ISA 3 allows for prevention of instruction fetch and execution
of user mode pages. If such an error occurs, SRR1 bit 35
reports the error. We catch and report the error in do_page_fault()

Signed-off-by: Balbir Singh 
---
 arch/powerpc/mm/fault.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index d0b137d..835fd03 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -404,6 +404,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
(cpu_has_feature(CPU_FTR_NOEXECUTE) ||
 !(vma->vm_flags & (VM_READ | VM_WRITE
goto bad_area;
+
+   if (radix_enabled() && (regs->msr & SRR1_ISI_N_OR_G))
+   goto bad_area;
+
 #ifdef CONFIG_PPC_STD_MMU
/*
 * protfault should only happen due to us
-- 
2.5.5



[PATCH 3/3] Enable storage keys for radix - user mode execution

2016-11-01 Thread Balbir Singh
ISA 3 defines new encoded access authority that allows instruction
access prevention in privileged mode and allows normal access
to problem state. This patch just enables IAMR (Instruction Authority
Mask Register), enabling AMR would require more work.

I've tested this with a buggy driver and a simple payload. The payload
is specific to the build I've tested.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/mm/pgtable-radix.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 0fdd8ed..cd3d400 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -339,6 +339,24 @@ static void __init radix_init_amor(void)
mtspr(SPRN_AMOR, amor);
 }
 
+/*
+ * For radix page tables we setup, the IAMR values as follows
+ * IMAR = 0100...00 (key 0 is set to 1)
+ * AMR, UAMR, UAMOR are not affected
+ */
+static void __init radix_init_iamr(void)
+{
+   unsigned long iamr_mask = 0x4000;
+   unsigned long iamr = mfspr(SPRN_IAMR);
+
+   if (cpu_has_feature(CPU_FTR_POWER9_DD1))
+   return;
+
+   iamr |= iamr_mask;
+
+   mtspr(SPRN_IAMR, iamr);
+}
+
 void __init radix__early_init_mmu(void)
 {
unsigned long lpcr;
@@ -398,6 +416,7 @@ void __init radix__early_init_mmu(void)
radix_init_amor();
}
 
+   radix_init_iamr();
radix_init_pgtable();
 }
 
-- 
2.5.5



[PATCH 1/4] ibmebus: fix device reference leaks in sysfs interface

2016-11-01 Thread Johan Hovold
Make sure to drop any reference taken by bus_find_device() in the sysfs
callbacks that are used to create and destroy devices based on
device-tree entries.

Fixes: 6bccf755ff53 ("[POWERPC] ibmebus: dynamic addition/removal...)
Signed-off-by: Johan Hovold 
---
 arch/powerpc/kernel/ibmebus.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/ibmebus.c b/arch/powerpc/kernel/ibmebus.c
index 6ca9a2ffaac7..c7d3ff7e101c 100644
--- a/arch/powerpc/kernel/ibmebus.c
+++ b/arch/powerpc/kernel/ibmebus.c
@@ -262,6 +262,7 @@ static ssize_t ibmebus_store_probe(struct bus_type *bus,
   const char *buf, size_t count)
 {
struct device_node *dn = NULL;
+   struct device *dev;
char *path;
ssize_t rc = 0;
 
@@ -269,8 +270,10 @@ static ssize_t ibmebus_store_probe(struct bus_type *bus,
if (!path)
return -ENOMEM;
 
-   if (bus_find_device(&ibmebus_bus_type, NULL, path,
-   ibmebus_match_path)) {
+   dev = bus_find_device(&ibmebus_bus_type, NULL, path,
+ ibmebus_match_path);
+   if (dev) {
+   put_device(dev);
printk(KERN_WARNING "%s: %s has already been probed\n",
   __func__, path);
rc = -EEXIST;
@@ -307,6 +310,7 @@ static ssize_t ibmebus_store_remove(struct bus_type *bus,
if ((dev = bus_find_device(&ibmebus_bus_type, NULL, path,
   ibmebus_match_path))) {
of_device_unregister(to_platform_device(dev));
+   put_device(dev);
 
kfree(path);
return count;
-- 
2.7.3



[PATCH 2/4] ibmebus: fix further device reference leaks

2016-11-01 Thread Johan Hovold
Make sure to drop any reference taken by bus_find_device() when creating
devices during init and driver registration.

Fixes: 55347cc9962f ("[POWERPC] ibmebus: Add device creation...)
Signed-off-by: Johan Hovold 
---
 arch/powerpc/kernel/ibmebus.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/ibmebus.c b/arch/powerpc/kernel/ibmebus.c
index c7d3ff7e101c..35f5244782d9 100644
--- a/arch/powerpc/kernel/ibmebus.c
+++ b/arch/powerpc/kernel/ibmebus.c
@@ -180,6 +180,7 @@ static int ibmebus_create_device(struct device_node *dn)
 static int ibmebus_create_devices(const struct of_device_id *matches)
 {
struct device_node *root, *child;
+   struct device *dev;
int ret = 0;
 
root = of_find_node_by_path("/");
@@ -188,9 +189,12 @@ static int ibmebus_create_devices(const struct 
of_device_id *matches)
if (!of_match_node(matches, child))
continue;
 
-   if (bus_find_device(&ibmebus_bus_type, NULL, child,
-   ibmebus_match_node))
+   dev = bus_find_device(&ibmebus_bus_type, NULL, child,
+ ibmebus_match_node);
+   if (dev) {
+   put_device(dev);
continue;
+   }
 
ret = ibmebus_create_device(child);
if (ret) {
-- 
2.7.3



[PATCH 4/4] powerpc/pci: fix device reference leaks

2016-11-01 Thread Johan Hovold
Make sure to drop any device reference taken by vio_find_node() when
adding and removing virtual I/O slots.

Fixes: 5eeb8c63a38f ("[PATCH] PCI Hotplug: rpaphp: Move VIO...")
Signed-off-by: Johan Hovold 
---
 drivers/pci/hotplug/rpadlpar_core.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/rpadlpar_core.c 
b/drivers/pci/hotplug/rpadlpar_core.c
index dc67f39779ec..c614ff7c3bc3 100644
--- a/drivers/pci/hotplug/rpadlpar_core.c
+++ b/drivers/pci/hotplug/rpadlpar_core.c
@@ -257,8 +257,13 @@ static int dlpar_add_phb(char *drc_name, struct 
device_node *dn)
 
 static int dlpar_add_vio_slot(char *drc_name, struct device_node *dn)
 {
-   if (vio_find_node(dn))
+   struct vio_dev *vio_dev;
+
+   vio_dev = vio_find_node(dn);
+   if (vio_dev) {
+   put_device(&vio_dev->dev);
return -EINVAL;
+   }
 
if (!vio_register_device_node(dn)) {
printk(KERN_ERR
@@ -334,6 +339,9 @@ static int dlpar_remove_vio_slot(char *drc_name, struct 
device_node *dn)
return -EINVAL;
 
vio_unregister_device(vio_dev);
+
+   put_device(&vio_dev->dev);
+
return 0;
 }
 
-- 
2.7.3



[PATCH 3/4] powerpc/vio: clarify vio_find_node reference counting

2016-11-01 Thread Johan Hovold
Add comment clarifying that vio_find_node() takes a reference to the
embedded struct device which needs to be dropped after use.

Signed-off-by: Johan Hovold 
---
 arch/powerpc/kernel/vio.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index b3813ddb2fb4..2c8fb3ec989e 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1648,6 +1648,9 @@ static struct vio_dev *vio_find_name(const char *name)
 /**
  * vio_find_node - find an already-registered vio_dev
  * @vnode: device_node of the virtual device we're looking for
+ *
+ * Takes a reference to the embedded struct device which needs to be dropped
+ * after use.
  */
 struct vio_dev *vio_find_node(struct device_node *vnode)
 {
-- 
2.7.3



[RFC v2 6/7] mm/powerpc: Use generic VDSO remap and unmap functions

2016-11-01 Thread Christopher Covington
The PowerPC VDSO remap and unmap code was copied to a generic location,
only modifying the variable name expected in mm->context (vdso instead of
vdso_base) to match most other architectures. Having adopted this generic
naming, drop the code in arch/powerpc and use the generic version.

Signed-off-by: Christopher Covington 
---
 arch/powerpc/Kconfig |  1 +
 arch/powerpc/include/asm/Kbuild  |  1 +
 arch/powerpc/include/asm/mm-arch-hooks.h | 28 -
 arch/powerpc/include/asm/mmu_context.h   | 35 +---
 4 files changed, 3 insertions(+), 62 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/mm-arch-hooks.h

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 65fba4c..f4a1cb9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -134,6 +134,7 @@ config PPC
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
+   select GENERIC_VDSO
select HAVE_MOD_ARCH_SPECIFIC
select MODULES_USE_ELF_RELA
select CLONE_BACKWARDS
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 5c4fbc8..4d89ec6 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -8,3 +8,4 @@ generic-y += mcs_spinlock.h
 generic-y += preempt.h
 generic-y += rwsem.h
 generic-y += vtime.h
+generic-y += mm-arch-hooks.h
diff --git a/arch/powerpc/include/asm/mm-arch-hooks.h 
b/arch/powerpc/include/asm/mm-arch-hooks.h
deleted file mode 100644
index ea6da89..000
--- a/arch/powerpc/include/asm/mm-arch-hooks.h
+++ /dev/null
@@ -1,28 +0,0 @@
-/*
- * Architecture specific mm hooks
- *
- * Copyright (C) 2015, IBM Corporation
- * Author: Laurent Dufour 
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- */
-
-#ifndef _ASM_POWERPC_MM_ARCH_HOOKS_H
-#define _ASM_POWERPC_MM_ARCH_HOOKS_H
-
-static inline void arch_remap(struct mm_struct *mm,
- unsigned long old_start, unsigned long old_end,
- unsigned long new_start, unsigned long new_end)
-{
-   /*
-* mremap() doesn't allow moving multiple vmas so we can limit the
-* check to old_start == vdso.
-*/
-   if (old_start == mm->context.vdso)
-   mm->context.vdso = new_start;
-}
-#define arch_remap arch_remap
-
-#endif /* _ASM_POWERPC_MM_ARCH_HOOKS_H */
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index c907478..d8dcf45 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -8,6 +8,7 @@
 #include 
 #include
 #include 
+#include 
 #include 
 
 /*
@@ -133,39 +134,5 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
 #endif
 }
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-struct mm_struct *mm)
-{
-}
-
-static inline void arch_exit_mmap(struct mm_struct *mm)
-{
-}
-
-static inline void arch_unmap(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
-{
-   if (start <= mm->context.vdso && mm->context.vdso < end)
-   mm->context.vdso = 0;
-}
-
-static inline void arch_bprm_mm_init(struct mm_struct *mm,
-struct vm_area_struct *vma)
-{
-}
-
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-   bool write, bool execute, bool foreign)
-{
-   /* by default, allow everything */
-   return true;
-}
-
-static inline bool arch_pte_access_permitted(pte_t pte, bool write)
-{
-   /* by default, allow everything */
-   return true;
-}
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.



[RFC v2 5/7] powerpc: Rename context.vdso_base to context.vdso

2016-11-01 Thread Christopher Covington
Checkpoint/Restore In Userspace (CRIU) needs to be able to unmap and remap
the VDSO to successfully checkpoint and restore applications in the face of
changing VDSO addresses due to Address Space Layout Randomization (ASLR,
randmaps). x86 and PowerPC have had architecture-specific code to support
this. In order to expand the architectures that support this without
unnecessary duplication of code, a generic version based on the PowerPC code
was created. It differs slightly, based on the results of an informal
survey of all architectures that indicated

unsigned long vdso;

is popular (and it's also concise). Therefore, change the variable name in
powerpc from mm->context.vdso_base to mm->context.vdso.

Signed-off-by: Christopher Covington 
---
 arch/powerpc/include/asm/book3s/32/mmu-hash.h |  2 +-
 arch/powerpc/include/asm/book3s/64/mmu.h  |  2 +-
 arch/powerpc/include/asm/mm-arch-hooks.h  |  6 +++---
 arch/powerpc/include/asm/mmu-40x.h|  2 +-
 arch/powerpc/include/asm/mmu-44x.h|  2 +-
 arch/powerpc/include/asm/mmu-8xx.h|  2 +-
 arch/powerpc/include/asm/mmu-book3e.h |  2 +-
 arch/powerpc/include/asm/mmu_context.h|  4 ++--
 arch/powerpc/include/asm/vdso.h   |  2 +-
 arch/powerpc/include/uapi/asm/elf.h   |  2 +-
 arch/powerpc/kernel/signal_32.c   |  8 
 arch/powerpc/kernel/signal_64.c   |  4 ++--
 arch/powerpc/kernel/vdso.c|  8 
 arch/powerpc/perf/callchain.c | 12 ++--
 14 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/32/mmu-hash.h
index b82e063..75738bb 100644
--- a/arch/powerpc/include/asm/book3s/32/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/32/mmu-hash.h
@@ -79,7 +79,7 @@ struct hash_pte {
 
 typedef struct {
unsigned long id;
-   unsigned long vdso_base;
+   unsigned long vdso;
 } mm_context_t;
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 8afb0e0..8486a10 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -72,7 +72,7 @@ typedef struct {
 #else
u16 sllp;   /* SLB page size encoding */
 #endif
-   unsigned long vdso_base;
+   unsigned long vdso;
 #ifdef CONFIG_PPC_SUBPAGE_PROT
struct subpage_prot_table spt;
 #endif /* CONFIG_PPC_SUBPAGE_PROT */
diff --git a/arch/powerpc/include/asm/mm-arch-hooks.h 
b/arch/powerpc/include/asm/mm-arch-hooks.h
index f2a2da8..ea6da89 100644
--- a/arch/powerpc/include/asm/mm-arch-hooks.h
+++ b/arch/powerpc/include/asm/mm-arch-hooks.h
@@ -18,10 +18,10 @@ static inline void arch_remap(struct mm_struct *mm,
 {
/*
 * mremap() doesn't allow moving multiple vmas so we can limit the
-* check to old_start == vdso_base.
+* check to old_start == vdso.
 */
-   if (old_start == mm->context.vdso_base)
-   mm->context.vdso_base = new_start;
+   if (old_start == mm->context.vdso)
+   mm->context.vdso = new_start;
 }
 #define arch_remap arch_remap
 
diff --git a/arch/powerpc/include/asm/mmu-40x.h 
b/arch/powerpc/include/asm/mmu-40x.h
index 3491686..e8e57b7 100644
--- a/arch/powerpc/include/asm/mmu-40x.h
+++ b/arch/powerpc/include/asm/mmu-40x.h
@@ -56,7 +56,7 @@
 typedef struct {
unsigned intid;
unsigned intactive;
-   unsigned long   vdso_base;
+   unsigned long   vdso;
 } mm_context_t;
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/mmu-44x.h 
b/arch/powerpc/include/asm/mmu-44x.h
index bf52d70..471891c 100644
--- a/arch/powerpc/include/asm/mmu-44x.h
+++ b/arch/powerpc/include/asm/mmu-44x.h
@@ -107,7 +107,7 @@ extern unsigned int tlb_44x_index;
 typedef struct {
unsigned intid;
unsigned intactive;
-   unsigned long   vdso_base;
+   unsigned long   vdso;
 } mm_context_t;
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/mmu-8xx.h 
b/arch/powerpc/include/asm/mmu-8xx.h
index 3e0e492..2834af0 100644
--- a/arch/powerpc/include/asm/mmu-8xx.h
+++ b/arch/powerpc/include/asm/mmu-8xx.h
@@ -167,7 +167,7 @@
 typedef struct {
unsigned int id;
unsigned int active;
-   unsigned long vdso_base;
+   unsigned long vdso;
 } mm_context_t;
 
 #define PHYS_IMMR_BASE (mfspr(SPRN_IMMR) & 0xfff8)
diff --git a/arch/powerpc/include/asm/mmu-book3e.h 
b/arch/powerpc/include/asm/mmu-book3e.h
index b62a8d4..28dc4e0 100644
--- a/arch/powerpc/include/asm/mmu-book3e.h
+++ b/arch/powerpc/include/asm/mmu-book3e.h
@@ -228,7 +228,7 @@ extern unsigned int tlbcam_index;
 typedef struct {
unsigned intid;
unsigned intactive;
-   unsigned long   vdso_base;
+   unsigned long   vdso;
 #ifdef CONFIG_PPC_MM_SLICES
u64 low_slices_psize;   /* S

Re: [PATCH v2 2/3] cpuidle:powernv: Add helper function to populate powernv idle states.

2016-11-01 Thread Paul Mackerras
On Tue, Nov 01, 2016 at 07:32:58PM +1100, Oliver O'Halloran wrote:
> On Thu, Oct 27, 2016 at 7:35 PM, Gautham R. Shenoy
>  wrote:
> > From: "Gautham R. Shenoy" 
> >
> > In the current code for powernv_add_idle_states, there is a lot of code
> > duplication while initializing an idle state in powernv_states table.
> >
> > Add an inline helper function to populate the powernv_states[] table for
> > a given idle state. Invoke this for populating the "Nap", "Fastsleep"
> > and the stop states in powernv_add_idle_states.
> >
> > Signed-off-by: Gautham R. Shenoy 
> > ---
> >  drivers/cpuidle/cpuidle-powernv.c | 82 
> > +++
> >  include/linux/cpuidle.h   |  1 +
> >  2 files changed, 49 insertions(+), 34 deletions(-)
> >
> > diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> > b/drivers/cpuidle/cpuidle-powernv.c
> > index 7fe442c..11b22b9 100644
> > --- a/drivers/cpuidle/cpuidle-powernv.c
> > +++ b/drivers/cpuidle/cpuidle-powernv.c
> > @@ -167,6 +167,28 @@ static int powernv_cpuidle_driver_init(void)
> > return 0;
> >  }
> >
> > +static inline void add_powernv_state(int index, const char *name,
> > +unsigned int flags,
> > +int (*idle_fn)(struct cpuidle_device *,
> > +   struct cpuidle_driver *,
> > +   int),
> > +unsigned int target_residency,
> > +unsigned int exit_latency,
> > +u64 psscr_val)
> > +{
> > +   strncpy(powernv_states[index].name, name, CPUIDLE_NAME_LEN);
> > +   strncpy(powernv_states[index].desc, name, CPUIDLE_NAME_LEN);
> 
> If the supplied name is equal to CPUIDLE_NAME_LEN then strncpy() won't
> terminate the string. The least annoying fix is to memset() the whole
> structure to zero and set n to CPUIDLE_NAME_LEN - 1.

Or he could use strlcpy() instead of strncpy().

Paul.


Re: [PATCH] cxl: Fix error handling

2016-11-01 Thread Jim Davis
On Sun, Oct 30, 2016 at 10:37 PM, Michael Ellerman  wrote:
> Christophe JAILLET  writes:
>
>> 'cxl_dev_context_init()' returns an error pointer in case of error, not
>> NULL. So test it with IS_ERR.
>>
>> Signed-off-by: Christophe JAILLET 
>> ---
>> un-compiled because I don't have the required  cross build environment.
>
> Do you run Ubuntu or Fedora? If so it's just a dnf/apt-get away:
>
> $ sudo dnf install gcc-c++-powerpc64-linux-gnu binutils-powerpc64-linux-gnu 
> gcc-powerpc64-linux-gnu
> or
> $ sudo apt-get install gcc-powerpc64le-linux-gnu gcc-powerpc-linux-gnu 
> libc-dev-powerpc-cross libc-dev-ppc64el-cross
>
> More here:
>
> https://github.com/linuxppc/linux/wiki/Building-powerpc-kernels

Cool; the little-endian build worked fine, but

jim@krebstar:~/linux-rc$ make ARCH=powerpc
CROSS_COMPILE=powerpc64-linux-gnu- vmlinux
make: powerpc64-linux-gnu-gcc: Command not found
make: powerpc64-linux-gnu-gcc: Command not found
scripts/kconfig/conf  --silentoldconfig Kconfig
make: powerpc64-linux-gnu-gcc: Command not found

This is on Ubuntu 16.04; there's a /usr/bin/powerpc64le-linux-gnu-gcc
from installing gcc-powerpc64le-linux-gnu, and a
/usr/bin/powerpc-linux-gnu-gcc from installing gcc-powerpc-linux-gnu,
but no /usr/bin/powerpc64-linux-gnu-gcc.

-- 
Jim


Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

2016-11-01 Thread Alexey Kardashevskiy
On 31/10/16 15:23, David Gibson wrote:
> On Mon, Oct 31, 2016 at 03:13:21PM +1100, Alexey Kardashevskiy wrote:
>> On 31/10/16 14:13, David Gibson wrote:
>>> On Tue, Oct 25, 2016 at 03:55:56PM +1100, Alexey Kardashevskiy wrote:
 On 25/10/16 15:44, David Gibson wrote:
> On Mon, Oct 24, 2016 at 05:53:10PM +1100, Alexey Kardashevskiy wrote:
>> At the moment the userspace tool is expected to request pinning of
>> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
>> When the userspace process finishes, all the pinned pages need to
>> be put; this is done as a part of the userspace memory context (MM)
>> destruction which happens on the very last mmdrop().
>>
>> This approach has a problem that a MM of the userspace process
>> may live longer than the userspace process itself as kernel threads
>> use userspace process MMs which was runnning on a CPU where
>> the kernel thread was scheduled to. If this happened, the MM remains
>> referenced until this exact kernel thread wakes up again
>> and releases the very last reference to the MM, on an idle system this
>> can take even hours.
>>
>> This moves preregistered regions tracking from MM to VFIO; insteads of
>> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
>> added so each container releases regions which it has pre-registered.
>>
>> This changes the userspace interface to return EBUSY if a memory
>> region is already registered in a container. However it should not
>> have any practical effect as the only userspace tool available now
>> does register memory region once per container anyway.
>>
>> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
>> under container->lock, this does not need additional locking.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> Reviewed-by: Nicholas Piggin 
>
> On the grounds that this leaves things in a better state than before:
>
> Reviewed-by: David Gibson 
>
> On the other hand the implementation is kind of clunky, with the way
> it keeps the mm-level and vfio-level lists of regions in parallel.
> With this change, does the mm-level list actually serve any purpose at
> all, or could it all be moved into the vfio-level list?


 The mm-level list allows not having gup() called for each container (minor
 thing, I suppose) and it also tracks a number of active mappings which will
 become useful when we add in-kernel real-mode TCE acceleration as
 vfio-level code cannot run in realmode.
>>>
>>> Hm, ok.  So, if two different containers pre-register the same region
>>> of memory, IIUC in the proposed code, the region will get one entry in
>>> the mm level list, and that entry will be referenced in the lists for
>>> both containers.  Yes?
>>
>> Yes.
>>
>>
>>> What happens if two different containers try to pre-register
>>> different, but overlapping, mm regions?
>>
>> The second container will fail to preregister memory - mm_iommu_get() will
>> return -EINVAL.
> 
> Um.. yeah.. that's not really ok.  Prohibiting overlapping
> registrations on the same container is reasonable enough.  Having a
> container not be able to register memory because some completely
> different container has registered something overlapping is getting
> very ugly.

I am lost here. Does this mean the patches cannot go upstream?

Also how would I implement overlapping if we are not teaching KVM about
VFIO containers? The mm list has a counter of how many times each memory
region was mapped via TCE (and this prevents unregistration), and if we
want overlapping regions - a "mapped" counter of which one would I update
in real mode (where I only have a user address and a LIOBN)?



> 
>> I am wondering what happens to the series now.
>>
>> Alex, could you please have a look and comment? Thanks.
>>
>>
>>
>>>



>
>> ---
>> Changes:
>> v4:
>> * changed tce_iommu_register_pages() to call mm_iommu_find() first and
>> avoid calling mm_iommu_put() if memory is preregistered already
>>
>> v3:
>> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
>>
>> v2:
>> * updated commit log
>> ---
>>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
>>  arch/powerpc/mm/mmu_context_iommu.c| 11 ---
>>  drivers/vfio/vfio_iommu_spapr_tce.c| 58 
>> +-
>>  3 files changed, 57 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
>> b/arch/powerpc/mm/mmu_context_book3s64.c
>> index ad82735..1a07969 100644
>> --- a/arch/powerpc/mm/mmu_context_book3s64.c
>> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
>> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct 
>> mm_struct *mm)
>>  
>>  void destroy_context(struct mm_struct *mm)
>>  {
>>>

Re: [kernel-hardening] [PATCH] powerpc/kernel: Disable the latent entropy plugin unconditionally

2016-11-01 Thread Andrew Donnellan

On 27/06/16 01:34, Emese Revfy wrote:


Reported-by: PaX Team 
Signed-off-by: Emese Revfy 
---
 arch/powerpc/kernel/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 01935b8..e9ef44f 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -14,11 +14,12 @@ CFLAGS_prom_init.o  += -fPIC
 CFLAGS_btext.o += -fPIC
 endif

-ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_cputable.o += $(DISABLE_LATENT_ENTROPY_PLUGIN)
 CFLAGS_init.o += $(DISABLE_LATENT_ENTROPY_PLUGIN)


I think you meant prom_init.o...

Additionally, DISABLE_LATENT_ENTROPY_PLUGIN is conditioned on 
CONFIG_PAX_LATENT_ENTROPY rather than CONFIG_GCC_PLUGIN_LATENT_ENTROPY, 
so it doesn't get exported correctly.


Will submit fixes along with patches to enable plugins on powerpc once I 
get that sorted.


(In future please remember to cc linuxppc-dev.)

--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



[PATCH v7 00/11] implement vcpu preempted check

2016-11-01 Thread Pan Xinhui
change from v6:
fix typos and remove uncessary comments.
change from v5:
spilt x86/kvm patch into guest/host part.
introduce kvm_write_guest_offset_cached.
fix some typos.
rebase patch onto 4.9.2
change from v4:
spilt x86 kvm vcpu preempted check into two patches.
add documentation patch.
add x86 vcpu preempted check patch under xen
add s390 vcpu preempted check patch 
change from v3:
add x86 vcpu preempted check patch
change from v2:
no code change, fix typos, update some comments
change from v1:
a simplier definition of default vcpu_is_preempted
skip mahcine type check on ppc, and add config. remove dedicated macro.
add one patch to drop overload of rwsem_spin_on_owner and 
mutex_spin_on_owner. 
add more comments
thanks boqun and Peter's suggestion.

This patch set aims to fix lock holder preemption issues.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
These spin_on_onwer variant also cause rcu stall before we apply this patch set

We also have observed some performace improvements in uninx benchmark tests.

PPC test result:
1 copy - 0.94%
2 copy - 7.17%
4 copy - 11.9%
8 copy -  3.04%
16 copy - 15.11%

details below:
Without patch:

1 copy - File Write 4096 bufsize 8000 maxblocks  2188223.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1804433.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1237257.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1032658.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   768000.0 KBps  (30.1 s, 
1 samples)

With patch: 

1 copy - File Write 4096 bufsize 8000 maxblocks  2209189.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1943816.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1405591.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1065080.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   904762.0 KBps  (30.0 s, 
1 samples)

X86 test result:
test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps 
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps 
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps 
Process Creation   |29881.2 lps  |28572.8 lps 
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm 
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm 
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps 


Christian Borntraeger (1):
  s390/spinlock: Provide vcpu_is_preempted

Juergen Gross (1):
  x86, xen: support vcpu preempted check

Pan Xinhui (9):
  kernel/sched: introduce vcpu preempted check interface
  locking/osq: Drop the overload of osq_lock()
  kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
  powerpc/spinlock: support vcpu preempted check
  x86, paravirt: Add interface to support kvm/xen vcpu preempted check
  KVM: Introduce kvm_write_guest_offset_cached
  x86, kvm/x86.c: support vcpu preempted check
  x86, kernel/kvm.c: support vcpu preempted check
  Documentation: virtual: kvm: Support vcpu preempted check

 Documentation/virtual/kvm/msr.txt |  9 -
 arch/powerpc/include/asm/spinlock.h   |  8 
 arch/s390/include/asm/spinlock.h  |  8 
 arch/s390/kernel/smp.c|  9 +++--
 arch/s390/lib/spinlock.c  | 25 -
 arch/x86/include/asm/paravirt_types.h |  2 ++
 arch/x86/include/asm/spinlock.h   |  8 
 arch/x86/include/uapi/asm/kvm_para.h  |  4 +++-
 arch/x86/kernel/kvm.c | 12 
 arch/x86/kernel/paravirt-spinlocks.c  |  6 ++
 arch/x86/kvm/x86.c| 16 
 arch/x86/xen/spinlock.c   |  3 ++-
 include/linux/kvm_host.h  |  2

[PATCH v7 01/11] kernel/sched: introduce vcpu preempted check interface

2016-11-01 Thread Pan Xinhui
This patch support to fix lock holder preemption issue.

For kernel users, we could use bool vcpu_is_preempted(int cpu) to detect
if one vcpu is preempted or not.

The default implementation is a macro defined by false. So compiler can
wrap it out if arch dose not support such vcpu preempted check.

Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Acked-by: Paolo Bonzini 
Tested-by: Juergen Gross 
---
 include/linux/sched.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b..44c1ce7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, 
unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+/*
+ * In order to deal with a various lock holder preemption issues provide an
+ * interface to see if a vCPU is currently running or not.
+ *
+ * This allows us to terminate optimistic spin loops and block, analogous to
+ * the native optimistic spin heuristic of testing if the lock owner task is
+ * running or not.
+ */
+#ifndef vcpu_is_preempted
+#define vcpu_is_preempted(cpu) false
+#endif
+
 extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
-- 
2.4.11



[PATCH v7 02/11] locking/osq: Drop the overload of osq_lock()

2016-11-01 Thread Pan Xinhui
An over-committed guest with more vCPUs than pCPUs has a heavy overload
in osq_lock().

This is because vCPU A hold the osq lock and yield out, vCPU B wait
per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and
unlock the osq lock.

Kernel has an interface bool vcpu_is_preempted(int cpu) to detect if a
vCPU is currently running or not. So break the spin loops on true
condition.

test case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

after patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

Suggested-by: Boqun Feng 
Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Acked-by: Paolo Bonzini 
Tested-by: Juergen Gross 
---
 kernel/locking/osq_lock.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..091f97f 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
 }
 
+static inline int node_cpu(struct optimistic_spin_node *node)
+{
+   return node->cpu - 1;
+}
+
 static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
 {
int cpu_nr = encoded_cpu_val - 1;
@@ -118,8 +123,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
while (!READ_ONCE(node->locked)) {
/*
 * If we need to reschedule bail... so we can block.
+* Use vcpu_is_preempted to detect lock holder preemption issue.
 */
-   if (need_resched())
+   if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
goto unqueue;
 
cpu_relax_lowlatency();
-- 
2.4.11



[PATCH v7 03/11] kernel/locking: Drop the overload of {mutex, rwsem}_spin_on_owner

2016-11-01 Thread Pan Xinhui
An over-committed guest with more vCPUs than pCPUs has a heavy overload
in the two spin_on_owner. This blames on the lock holder preemption
issue.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU
is currently running or not. So break the spin loops on true condition.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

after patch:
 9.99%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 5.28%  sched-messaging  [unknown] [H] 0xc00768e0
 4.27%  sched-messaging  [kernel.vmlinux]  [k] __copy_tofrom_user_power7
 3.77%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 3.24%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.02%  sched-messaging  [kernel.vmlinux]  [k] system_call
 2.69%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task

Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Acked-by: Paolo Bonzini 
Tested-by: Juergen Gross 
---
 kernel/locking/mutex.c  | 13 +++--
 kernel/locking/rwsem-xadd.c | 14 +++---
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..24face6 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -236,7 +236,11 @@ bool mutex_spin_on_owner(struct mutex *lock, struct 
task_struct *owner)
 */
barrier();
 
-   if (!owner->on_cpu || need_resched()) {
+   /*
+* Use vcpu_is_preempted to detect lock holder preemption issue.
+*/
+   if (!owner->on_cpu || need_resched() ||
+   vcpu_is_preempted(task_cpu(owner))) {
ret = false;
break;
}
@@ -261,8 +265,13 @@ static inline int mutex_can_spin_on_owner(struct mutex 
*lock)
 
rcu_read_lock();
owner = READ_ONCE(lock->owner);
+
+   /*
+* As lock holder preemption issue, we both skip spinning if task is not
+* on cpu or its cpu is preempted
+*/
if (owner)
-   retval = owner->on_cpu;
+   retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
rcu_read_unlock();
/*
 * if lock->owner is not set, the mutex owner may have just acquired
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2337b4b..b664ce1 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
goto done;
}
 
-   ret = owner->on_cpu;
+   /*
+* As lock holder preemption issue, we both skip spinning if task is not
+* on cpu or its cpu is preempted
+*/
+   ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 done:
rcu_read_unlock();
return ret;
@@ -362,8 +366,12 @@ static noinline bool rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 */
barrier();
 
-   /* abort spinning when need_resched or owner is not running */
-   if (!owner->on_cpu || need_resched()) {
+   /*
+* abort spinning when need_resched or owner is not running or
+* owner's cpu is preempted.
+*/
+   if (!owner->on_cpu || need_resched() ||
+   vcpu_is_preempted(task_cpu(owner))) {
rcu_read_unlock();
return false;
}
-- 
2.4.11



[PATCH v7 04/11] powerpc/spinlock: support vcpu preempted check

2016-11-01 Thread Pan Xinhui
This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted. Then
kernel can break the spin loops upon the retval of vcpu_is_preempted.

As kernel has used this interface, So lets support it.

Only pSeries need support it. And the fact is PowerNV is built into same
kernel image with pSeries. So we need return false if we are runnig as
PowerNV. The another fact is that lppaca->yiled_count keeps zero on
PowerNV. So we can just skip the machine type check.

Suggested-by: Boqun Feng 
Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Pan Xinhui 
---
 arch/powerpc/include/asm/spinlock.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index fa37fe9..8c1b913 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -52,6 +52,14 @@
 #define SYNC_IO
 #endif
 
+#ifdef CONFIG_PPC_PSERIES
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+   return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
+}
+#endif
+
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
return lock.slock == 0;
-- 
2.4.11



[PATCH v7 05/11] s390/spinlock: Provide vcpu_is_preempted

2016-11-01 Thread Pan Xinhui
From: Christian Borntraeger 

this implements the s390 backend for commit
"kernel/sched: introduce vcpu preempted check interface"
by reworking the existing smp_vcpu_scheduled into
arch_vcpu_is_preempted. We can then also get rid of the
local cpu_is_preempted function by moving the
CIF_ENABLED_WAIT test into arch_vcpu_is_preempted.

Signed-off-by: Christian Borntraeger 
Acked-by: Heiko Carstens 
---
 arch/s390/include/asm/spinlock.h |  8 
 arch/s390/kernel/smp.c   |  9 +++--
 arch/s390/lib/spinlock.c | 25 -
 3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h
index 7e9e09f..7ecd890 100644
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, 
unsigned int new)
return __sync_bool_compare_and_swap(lock, old, new);
 }
 
+#ifndef CONFIG_SMP
+static inline bool arch_vcpu_is_preempted(int cpu) { return false; }
+#else
+bool arch_vcpu_is_preempted(int cpu);
+#endif
+
+#define vcpu_is_preempted arch_vcpu_is_preempted
+
 /*
  * Simple spin lock operations.  There are two variants, one clears IRQ's
  * on the local processor, one does not.
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 35531fe..b988ed1 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address)
return -1;
 }
 
-int smp_vcpu_scheduled(int cpu)
+bool arch_vcpu_is_preempted(int cpu)
 {
-   return pcpu_running(pcpu_devices + cpu);
+   if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
+   return false;
+   if (pcpu_running(pcpu_devices + cpu))
+   return false;
+   return true;
 }
+EXPORT_SYMBOL(arch_vcpu_is_preempted);
 
 void smp_yield_cpu(int cpu)
 {
diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c
index e5f50a7..e48a48e 100644
--- a/arch/s390/lib/spinlock.c
+++ b/arch/s390/lib/spinlock.c
@@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int 
*lock, unsigned int old)
asm(".insn rsy,0xeb22,%0,0,%1" : : "d" (old), "Q" (*lock));
 }
 
-static inline int cpu_is_preempted(int cpu)
-{
-   if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
-   return 0;
-   if (smp_vcpu_scheduled(cpu))
-   return 0;
-   return 1;
-}
-
 void arch_spin_lock_wait(arch_spinlock_t *lp)
 {
unsigned int cpu = SPINLOCK_LOCKVAL;
@@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
continue;
}
/* First iteration: check if the lock owner is running. */
-   if (first_diag && cpu_is_preempted(~owner)) {
+   if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
 * yield the CPU unconditionally. For LPAR rely on the
 * sense running status.
 */
-   if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+   if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned 
long flags)
continue;
}
/* Check if the lock owner is running. */
-   if (first_diag && cpu_is_preempted(~owner)) {
+   if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, 
unsigned long flags)
 * yield the CPU unconditionally. For LPAR rely on the
 * sense running status.
 */
-   if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+   if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw)
owner = 0;
while (1) {
if (count-- <= 0) {
-   if (owner && cpu_is_preempted(~owner))
+   if (owner && arch_vcpu_is_preempted(~owner))
smp_yield_cpu(~owner);
count = spin_retry;
}
@@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int 
prev)
owner = 0;
while (1) {
if (count-- <= 0) {
-   if (owner && cpu_is_preempted(~owner))
+   if

[PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check

2016-11-01 Thread Pan Xinhui
This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted.
Then kernel can break the spin loops upon the retval of
vcpu_is_preempted.

As kernel has used this interface, So lets support it.

To deal with kernel and kvm/xen, add vcpu_is_preempted into struct
pv_lock_ops.

Then kvm or xen could provide their own implementation to support
vcpu_is_preempted.

Signed-off-by: Pan Xinhui 
Acked-by: Paolo Bonzini 
---
 arch/x86/include/asm/paravirt_types.h | 2 ++
 arch/x86/include/asm/spinlock.h   | 8 
 arch/x86/kernel/paravirt-spinlocks.c  | 6 ++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 0f400c0..38c3bb7 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -310,6 +310,8 @@ struct pv_lock_ops {
 
void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu);
+
+   bool (*vcpu_is_preempted)(int cpu);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 921bea7..0526f59 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -26,6 +26,14 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+   return pv_lock_ops.vcpu_is_preempted(cpu);
+}
+#endif
+
 #include 
 
 /*
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 2c55a00..2f204dd 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
 }
 
+static bool native_vcpu_is_preempted(int cpu)
+{
+   return 0;
+}
+
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
.queued_spin_lock_slowpath = native_queued_spin_lock_slowpath,
.queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock),
.wait = paravirt_nop,
.kick = paravirt_nop,
+   .vcpu_is_preempted = native_vcpu_is_preempted,
 #endif /* SMP */
 };
 EXPORT_SYMBOL(pv_lock_ops);
-- 
2.4.11



[PATCH v7 07/11] KVM: Introduce kvm_write_guest_offset_cached

2016-11-01 Thread Pan Xinhui
It allows us to update some status or field of one struct partially.

We can also save one kvm_read_guest_cached if we just update one filed
of the struct regardless of its current value.

Signed-off-by: Pan Xinhui 
Acked-by: Paolo Bonzini 
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 20 ++--
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 01c0b9c..6f00237 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -645,6 +645,8 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void 
*data,
unsigned long len);
 int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
   void *data, unsigned long len);
+int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache 
*ghc,
+  void *data, int offset, unsigned long len);
 int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
  gpa_t gpa, unsigned long len);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2907b7b..95308ee 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1972,30 +1972,38 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
 }
 EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
 
-int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
-  void *data, unsigned long len)
+int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache 
*ghc,
+  void *data, int offset, unsigned long len)
 {
struct kvm_memslots *slots = kvm_memslots(kvm);
int r;
+   gpa_t gpa = ghc->gpa + offset;
 
-   BUG_ON(len > ghc->len);
+   BUG_ON(len + offset > ghc->len);
 
if (slots->generation != ghc->generation)
kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa, ghc->len);
 
if (unlikely(!ghc->memslot))
-   return kvm_write_guest(kvm, ghc->gpa, data, len);
+   return kvm_write_guest(kvm, gpa, data, len);
 
if (kvm_is_error_hva(ghc->hva))
return -EFAULT;
 
-   r = __copy_to_user((void __user *)ghc->hva, data, len);
+   r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
if (r)
return -EFAULT;
-   mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+   mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
 
return 0;
 }
+EXPORT_SYMBOL_GPL(kvm_write_guest_offset_cached);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+  void *data, unsigned long len)
+{
+   return kvm_write_guest_offset_cached(kvm, ghc, data, 0, len);
+}
 EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
 
 int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
-- 
2.4.11



[PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check

2016-11-01 Thread Pan Xinhui
Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

Use one field of struct kvm_steal_time ::preempted to indicate that if
one vcpu is running or not.

Signed-off-by: Pan Xinhui 
Acked-by: Paolo Bonzini 
---
 arch/x86/include/uapi/asm/kvm_para.h |  4 +++-
 arch/x86/kvm/x86.c   | 16 
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 94dc8ca..1421a65 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -45,7 +45,9 @@ struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u8  preempted;
+   __u8  u8_pad[3];
+   __u32 pad[11];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e375235..f06e115 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time
return;
 
+   vcpu->arch.st.steal.preempted = 0;
+
if (vcpu->arch.st.steal.version & 1)
vcpu->arch.st.steal.version += 1;  /* first time write, random 
junk */
 
@@ -2810,8 +2812,22 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 }
 
+static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
+{
+   if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
+   return;
+
+   vcpu->arch.st.steal.preempted = 1;
+
+   kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
+   &vcpu->arch.st.steal.preempted,
+   offsetof(struct kvm_steal_time, preempted),
+   sizeof(vcpu->arch.st.steal.preempted));
+}
+
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+   kvm_steal_time_set_preempted(vcpu);
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
-- 
2.4.11



[PATCH v7 09/11] x86, kernel/kvm.c: support vcpu preempted check

2016-11-01 Thread Pan Xinhui
Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

struct kvm_steal_time::preempted indicate that if one vcpu is running or
not after commit("x86, kvm/x86.c: support vcpu preempted check").

unix benchmark result:
host:  kernel 4.8.1, i5-4570, 4 cpus
guest: kernel 4.8.1, 8 vcpus

test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps
Process Creation   |29881.2 lps  |28572.8 lps
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps

Signed-off-by: Pan Xinhui 
Acked-by: Paolo Bonzini 
---
 arch/x86/kernel/kvm.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index edbbfc8..0b48dd2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -415,6 +415,15 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+static bool kvm_vcpu_is_preempted(int cpu)
+{
+   struct kvm_steal_time *src;
+
+   src = &per_cpu(steal_time, cpu);
+
+   return !!src->preempted;
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -471,6 +480,9 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
has_steal_clock = 1;
pv_time_ops.steal_clock = kvm_steal_clock;
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+   pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;
+#endif
}
 
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
-- 
2.4.11



[PATCH v7 10/11] x86, xen: support vcpu preempted check

2016-11-01 Thread Pan Xinhui
From: Juergen Gross 

Support the vcpu_is_preempted() functionality under Xen. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

A quick test (4 vcpus on 1 physical cpu doing a parallel build job
with "make -j 8") reduced system time by about 5% with this patch.

Signed-off-by: Juergen Gross 
Signed-off-by: Pan Xinhui 
---
 arch/x86/xen/spinlock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 3d6e006..74756bb 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
 }
 
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.queued_spin_unlock = 
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = xen_qlock_wait;
pv_lock_ops.kick = xen_qlock_kick;
+
+   pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
 }
 
 /*
-- 
2.4.11



[PATCH v7 11/11] Documentation: virtual: kvm: Support vcpu preempted check

2016-11-01 Thread Pan Xinhui
Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
preempted" into struct kvm_steal_time. This field tells if one vcpu is
running or not.

It is zero if 1) some old KVM deos not support this filed. 2) the vcpu
is not preempted. Other values means the vcpu has been preempted.

Signed-off-by: Pan Xinhui 
Acked-by: Radim Krčmář 
Acked-by: Paolo Bonzini 
---
 Documentation/virtual/kvm/msr.txt | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index 2a71c8f..ab2ab76 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u8  preempted;
+   __u8  u8_pad[3];
+   __u32 pad[11];
}
 
whose data will be filled in by the hypervisor periodically. Only one
@@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
nanoseconds. Time during which the vcpu is idle, will not be
reported as steal time.
 
+   preempted: indicate the VCPU who owns this struct is running or
+   not. Non-zero values mean the VCPU has been preempted. Zero
+   means the VCPU is not preempted. NOTE, it is always zero if the
+   the hypervisor doesn't support this field.
+
 MSR_KVM_EOI_EN: 0x4b564d04
data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
when disabled.  Bit 1 is reserved and must be zero.  When PV end of
-- 
2.4.11



[RFC] kexec_file: Add support for purgatory built as PIE

2016-11-01 Thread Thiago Jung Bauermann
Hello,

The kexec_file code currently builds the purgatory as a partially linked object 
(using ld -r). Is there a particular reason to use that instead of a position 
independent executable (PIE)?

I found a discussion from 2013 in the archives but from what I understood it 
was about the purgatory as a separate object vs having it linked into the 
kernel, which is different from what I'm asking:

http://lists.infradead.org/pipermail/kexec/2013-December/010535.html

Here is my motivation for this question:

 On ppc64 purgatory.ro has 12 relocation types when built as a partially 
linked object. This makes arch_kexec_apply_relocations_add duplicate a lot of 
code with module_64.c:apply_relocate_add to implement these relocations. The 
alternative is to do some refactoring so that both functions can share the 
implementation of the relocations. This is done in patches 5 and 6 of the 
kexec_file_load implementation for powerpc:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-October/149984.html

Michael Ellerman would prefer if module_64.c didn't need to be changed, and 
suggested that the purgatory could be a position independent executable. 
Indeed, in that case there are only 4 relocation types in purgatory.ro (which 
aren't even implemented in module_64.c:apply_relocate_add), so the relocation 
code for the purgatory can leave that file alone and have its own relocation 
implementation.

Also, the purgatory is an executable and not an intermediary output from the 
compiler, so in my mind it makes sense conceptually that it is easier to build 
it as a PIE than as a partially linked object.

The patch below adds the support needed in kexec_file.c to allow powerpc-
specific code to load and relocate a purgatory binary built as PIE. This is WIP 
and can probably be refined a bit. Would you accept a change along these lines?

Signed-off-by: Thiago Jung Bauermann 
---
 arch/Kconfig|   3 +
 kernel/kexec_file.c | 159 ++--
 kernel/kexec_internal.h |  26 
 3 files changed, 183 insertions(+), 5 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 659bdd079277..7fd6879be222 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -5,6 +5,9 @@
 config KEXEC_CORE
bool
 
+config HAVE_KEXEC_FILE_PIE_PURGATORY
+   bool
+
 config OPROFILE
tristate "OProfile system profiling"
depends on PROFILING
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 0c2df7f73792..dfc3e015160d 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -633,7 +633,149 @@ static int kexec_calculate_store_digests(struct kimage 
*image)
return ret;
 }
 
-/* Actually load purgatory. Lot of code taken from kexec-tools */
+#ifdef CONFIG_HAVE_KEXEC_FILE_PIE_PURGATORY
+/* Load PIE purgatory using the program header information. */
+static int __kexec_load_purgatory(struct kimage *image, unsigned long min,
+ unsigned long max, int top_down)
+{
+   struct purgatory_info *pi = &image->purgatory_info;
+   unsigned long first_offset;
+   unsigned long orig_load_addr = 0;
+   const void *src;
+   int i, ret;
+   const Elf_Phdr *phdrs = (const void *) pi->ehdr + pi->ehdr->e_phoff;
+   const Elf_Phdr *phdr;
+   const Elf_Shdr *sechdrs_c;
+   Elf_Shdr *sechdr;
+   Elf_Shdr *sechdrs = NULL;
+   struct kexec_buf kbuf = { .image = image, .bufsz = 0, .buf_align = 1,
+ .buf_min = min, .buf_max = max,
+ .top_down = top_down };
+
+   /*
+* sechdrs_c points to section headers in purgatory and are read
+* only. No modifications allowed.
+*/
+   sechdrs_c = (void *) pi->ehdr + pi->ehdr->e_shoff;
+
+   /*
+* We can not modify sechdrs_c[] and its fields. It is read only.
+* Copy it over to a local copy where one can store some temporary
+* data and free it at the end. We need to modify ->sh_addr and
+* ->sh_offset fields to keep track of permanent and temporary
+* locations of sections.
+*/
+   sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
+   if (!sechdrs)
+   return -ENOMEM;
+
+   memcpy(sechdrs, sechdrs_c, pi->ehdr->e_shnum * sizeof(Elf_Shdr));
+
+   /*
+* We seem to have multiple copies of sections. First copy is which
+* is embedded in kernel in read only section. Some of these sections
+* will be copied to a temporary buffer and relocated. And these
+* sections will finally be copied to their final destination at
+* segment load time.
+*
+* Use ->sh_offset to reflect section address in memory. It will
+* point to original read only copy if section is not allocatable.
+* Otherwise it will point to temporary copy which will be relocated.
+*
+* Use ->sh_addr to contain final address of the section where it

Re: [PATCH] powerpc/64s: relocation, register save fixes for system reset interrupt

2016-11-01 Thread Mahesh Jagannath Salgaonkar
On 10/13/2016 07:47 AM, Nicholas Piggin wrote:
> This patch does a couple of things. First of all, powernv immediately
> explodes when running a relocated kernel, because the system reset
> exception for handling sleeps does not do correct relocated branches.
> 
> Secondly, the sleep handling code trashes the condition and cfar
> registers, which we would like to preserve for debugging purposes (for
> non-sleep case exception).
> 
> This patch changes the exception to use the standard format that saves
> registers before any tests or branches are made. It adds the test for
> idle-wakeup as an "extra" to break out of the normal exception path.
> Then it branches to a relocated idle handler that calls the various
> idle handling functions.
> 
> After this patch, POWER8 CPU simulator now boots powernv kernel that is
> running at non-zero.
> 
> Cc: Balbir Singh 
> Cc: Shreyas B. Prabhu 
> Cc: Gautham R. Shenoy 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/exception-64s.h | 16 ++
>  arch/powerpc/kernel/exceptions-64s.S | 50 
> ++--
>  2 files changed, 45 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/exception-64s.h 
> b/arch/powerpc/include/asm/exception-64s.h
> index 2e4e7d8..84d49b1 100644
> --- a/arch/powerpc/include/asm/exception-64s.h
> +++ b/arch/powerpc/include/asm/exception-64s.h
> @@ -93,6 +93,10 @@
>   ld  reg,PACAKBASE(r13); /* get high part of &label */   \
>   ori reg,reg,(FIXED_SYMBOL_ABS_ADDR(label))@l;
> 
> +#define __LOAD_HANDLER(reg, label)   \
> + ld  reg,PACAKBASE(r13); \
> + ori reg,reg,(ABS_ADDR(label))@l;
> +
>  /* Exception register prefixes */
>  #define EXC_HV   H
>  #define EXC_STD
> @@ -208,6 +212,18 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
>  #define kvmppc_interrupt kvmppc_interrupt_pr
>  #endif
> 
> +#ifdef CONFIG_RELOCATABLE
> +#define BRANCH_TO_COMMON(reg, label) \
> + __LOAD_HANDLER(reg, label); \
> + mtctr   reg;\
> + bctr
> +
> +#else
> +#define BRANCH_TO_COMMON(reg, label) \
> + b   label
> +
> +#endif
> +
>  #define __KVM_HANDLER_PROLOG(area, n)
> \
>   BEGIN_FTR_SECTION_NESTED(947)   \
>   ld  r10,area+EX_CFAR(r13);  \
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 08992f8..e680e84 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -95,19 +95,35 @@ __start_interrupts:
>  /* No virt vectors corresponding with 0x0..0x100 */
>  EXC_VIRT_NONE(0x4000, 0x4100)
> 
> -EXC_REAL_BEGIN(system_reset, 0x100, 0x200)
> - SET_SCRATCH0(r13)
> +
>  #ifdef CONFIG_PPC_P7_NAP
> -BEGIN_FTR_SECTION
> - /* Running native on arch 2.06 or later, check if we are
> -  * waking up from nap/sleep/winkle.
> + /*
> +  * If running native on arch 2.06 or later, check if we are waking up
> +  * from nap/sleep/winkle, and branch to idle handler.
>*/
> - mfspr   r13,SPRN_SRR1
> - rlwinm. r13,r13,47-31,30,31
> - beq 9f
> +#define IDLETEST(n)  \
> + BEGIN_FTR_SECTION ; \
> + mfspr   r10,SPRN_SRR1 ; \
> + rlwinm. r10,r10,47-31,30,31 ;   \
> + beq-1f ;\
> + cmpwi   cr3,r10,2 ; \
> + BRANCH_TO_COMMON(r10, system_reset_idle_common) ;   \
> +1:   \
> + END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
> +#else
> +#define IDLETEST NOTEST
> +#endif
> 
> - cmpwi   cr3,r13,2
> - GET_PACA(r13)
> +EXC_REAL_BEGIN(system_reset, 0x100, 0x200)
> + SET_SCRATCH0(r13)
> + EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, system_reset_common, EXC_STD,
> +  IDLETEST, 0x100)

Very sorry for late review. On arch 2.07 and less if we wakeup from
winkle then last bit of HSPGR0 would be set to 1. Hence before we access
paca we need to fix it by clearing that bit and that is done in
pnv_restore_hyp_resource(). But with this patch, we would end up there
after going through EXCEPTION_PROLOG_PSERIES(). This macro gets the paca
using GET_PACA(r13) and all the EXCEPTION_PROLOG_* starts
using/accessing r13/paca without fixing it. Wouldn't this break things
badly on arch 2.07 and less ? Am I missing anything ?

Thanks,
-Mahesh.



Re: [PATCH] powerpc/64s: relocation, register save fixes for system reset interrupt

2016-11-01 Thread Nicholas Piggin
On Wed, 2 Nov 2016 11:34:59 +0530
Mahesh Jagannath Salgaonkar  wrote:

> On 10/13/2016 07:47 AM, Nicholas Piggin wrote:
> > This patch does a couple of things. First of all, powernv immediately
> > explodes when running a relocated kernel, because the system reset
> > exception for handling sleeps does not do correct relocated branches.
> > 
> > Secondly, the sleep handling code trashes the condition and cfar
> > registers, which we would like to preserve for debugging purposes (for
> > non-sleep case exception).
> > 
> > This patch changes the exception to use the standard format that saves
> > registers before any tests or branches are made. It adds the test for
> > idle-wakeup as an "extra" to break out of the normal exception path.
> > Then it branches to a relocated idle handler that calls the various
> > idle handling functions.
> > 
> > After this patch, POWER8 CPU simulator now boots powernv kernel that is
> > running at non-zero.
> > 
> > Cc: Balbir Singh 
> > Cc: Shreyas B. Prabhu 
> > Cc: Gautham R. Shenoy 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/include/asm/exception-64s.h | 16 ++
> >  arch/powerpc/kernel/exceptions-64s.S | 50 
> > ++--
> >  2 files changed, 45 insertions(+), 21 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/exception-64s.h 
> > b/arch/powerpc/include/asm/exception-64s.h
> > index 2e4e7d8..84d49b1 100644
> > --- a/arch/powerpc/include/asm/exception-64s.h
> > +++ b/arch/powerpc/include/asm/exception-64s.h
> > @@ -93,6 +93,10 @@
> > ld  reg,PACAKBASE(r13); /* get high part of &label */   \
> > ori reg,reg,(FIXED_SYMBOL_ABS_ADDR(label))@l;
> > 
> > +#define __LOAD_HANDLER(reg, label) \
> > +   ld  reg,PACAKBASE(r13); \
> > +   ori reg,reg,(ABS_ADDR(label))@l;
> > +
> >  /* Exception register prefixes */
> >  #define EXC_HV H
> >  #define EXC_STD
> > @@ -208,6 +212,18 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
> >  #define kvmppc_interrupt kvmppc_interrupt_pr
> >  #endif
> > 
> > +#ifdef CONFIG_RELOCATABLE
> > +#define BRANCH_TO_COMMON(reg, label)   
> > \
> > +   __LOAD_HANDLER(reg, label); \
> > +   mtctr   reg;\
> > +   bctr
> > +
> > +#else
> > +#define BRANCH_TO_COMMON(reg, label)   
> > \
> > +   b   label
> > +
> > +#endif
> > +
> >  #define __KVM_HANDLER_PROLOG(area, n)  
> > \
> > BEGIN_FTR_SECTION_NESTED(947)   \
> > ld  r10,area+EX_CFAR(r13);  \
> > diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> > b/arch/powerpc/kernel/exceptions-64s.S
> > index 08992f8..e680e84 100644
> > --- a/arch/powerpc/kernel/exceptions-64s.S
> > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > @@ -95,19 +95,35 @@ __start_interrupts:
> >  /* No virt vectors corresponding with 0x0..0x100 */
> >  EXC_VIRT_NONE(0x4000, 0x4100)
> > 
> > -EXC_REAL_BEGIN(system_reset, 0x100, 0x200)
> > -   SET_SCRATCH0(r13)
> > +
> >  #ifdef CONFIG_PPC_P7_NAP
> > -BEGIN_FTR_SECTION
> > -   /* Running native on arch 2.06 or later, check if we are
> > -* waking up from nap/sleep/winkle.
> > +   /*
> > +* If running native on arch 2.06 or later, check if we are waking up
> > +* from nap/sleep/winkle, and branch to idle handler.
> >  */
> > -   mfspr   r13,SPRN_SRR1
> > -   rlwinm. r13,r13,47-31,30,31
> > -   beq 9f
> > +#define IDLETEST(n)
> > \
> > +   BEGIN_FTR_SECTION ; \
> > +   mfspr   r10,SPRN_SRR1 ; \
> > +   rlwinm. r10,r10,47-31,30,31 ;   \
> > +   beq-1f ;\
> > +   cmpwi   cr3,r10,2 ; \
> > +   BRANCH_TO_COMMON(r10, system_reset_idle_common) ;   \
> > +1: \
> > +   END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
> > +#else
> > +#define IDLETEST NOTEST
> > +#endif
> > 
> > -   cmpwi   cr3,r13,2
> > -   GET_PACA(r13)
> > +EXC_REAL_BEGIN(system_reset, 0x100, 0x200)
> > +   SET_SCRATCH0(r13)
> > +   EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, system_reset_common, EXC_STD,
> > +IDLETEST, 0x100)  
> 
> Very sorry for late review. On arch 2.07 and less if we wakeup from
> winkle then last bit of HSPGR0 would be set to 1. Hence before we access
> paca we need to fix it by clearing that bit and that is done in
> pnv_restore_hyp_resource(). But with this patch, we would end up there
> after going through EXCEPTION_PROLOG_PSERIES(). This macro gets the paca
> using GET_PACA(r13) and all the E