Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-06-24 Thread Mike Qiu

On 06/23/2014 10:14 AM, Gavin Shan wrote:

The patch implements one OPAL firmware sysfs file to support PCI error
injection: /sys/firmware/opal/errinjct, which will be used like the
way described as follows.

According to PAPR spec, there are 3 RTAS calls related to error injection:
ibm,open-errinjct: allocate token prior to doing error injection.
ibm,close-errinjct: release the token allocated from ibm,open-errinjct.
ibm,errinjct: do error injection.

Sysfs file /sys/firmware/opal/errinjct accepts strings that have fixed
format ei_token  For now, we only support 32-bits and 64-bits
PCI error injection and they should have following strings written to
/sys/firmware/opal/errinjct as follows. We don't have corresponding
sysfs files for ibm,open-errinjct and ibm,close-errinjct, which
means that we rely on userland to maintain the token by itself.

32-bits PCI error: 7:addr:mask:iommu_group_id:function.
64-bits PCI error: 8:addr:mask:iommu_group_id:function.

The above 7 and 8 represent 32-bits and 64-bits PCI error seperately
and function is one of the specific PCI errors (e.g. MMIO access address
parity error), which are defined by PAPR spec.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
  arch/powerpc/include/asm/opal.h|   1 +
  arch/powerpc/platforms/powernv/Makefile|   2 +-
  arch/powerpc/platforms/powernv/opal-errinjct.c | 184 +
  arch/powerpc/platforms/powernv/opal.c  |   2 +
  4 files changed, 188 insertions(+), 1 deletion(-)
  create mode 100644 arch/powerpc/platforms/powernv/opal-errinjct.c

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index d982bb8..bf280d9 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -985,6 +985,7 @@ extern int opal_elog_init(void);
  extern void opal_platform_dump_init(void);
  extern void opal_sys_param_init(void);
  extern void opal_msglog_init(void);
+extern void opal_errinjct_init(void);

  extern int opal_machine_check(struct pt_regs *regs);
  extern bool opal_mce_check_early_recovery(struct pt_regs *regs);
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index 63cebb9..4711de8 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -1,7 +1,7 @@
  obj-y += setup.o opal-takeover.o opal-wrappers.o opal.o 
opal-async.o
  obj-y += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
  obj-y += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
-obj-y  += opal-msglog.o
+obj-y  += opal-msglog.o opal-errinjct.o

  obj-$(CONFIG_SMP) += smp.o
  obj-$(CONFIG_PCI) += pci.o pci-p5ioc2.o pci-ioda.o
diff --git a/arch/powerpc/platforms/powernv/opal-errinjct.c 
b/arch/powerpc/platforms/powernv/opal-errinjct.c
new file mode 100644
index 000..29c9e83
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-errinjct.c
@@ -0,0 +1,184 @@
+/*
+ * The file supports error injection, which works based on OPAL API.
+ * For now, we only support PCI error injection. We need support
+ * injecting other types of errors in future.
+ *
+ * Copyright Gavin Shan, IBM Corporation 2014.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include linux/kernel.h
+#include linux/init.h
+#include linux/msi.h
+#include linux/pci.h
+#include linux/iommu.h
+#include linux/random.h
+#include linux/slab.h
+#include linux/sysfs.h
+#include linux/fs.h
+#include linux/fcntl.h
+#include linux/kobject.h
+
+#include asm/msi_bitmap.h
+#include asm/iommu.h
+#include asm/pci-bridge.h
+#include asm/ppc-pci.h
+#include asm/opal.h
+
+#include powernv.h
+#include pci.h
+
+static DEFINE_MUTEX(errinjct_mutex);
+
+static int errinjct_iommu_group_to_phb_and_pe(uint32_t iommu_grp_id,
+ uint64_t *phb_id,
+ uint32_t *pe_num)
+{
+#ifdef CONFIG_IOMMU_API


Is it reasonable to do error injection with CONFIG_IOMMU_API ?

That means if use default config(CONFIG_IOMMU_API = n),  we can not do 
error injection to pci devices?


Thanks
Mike

+   struct iommu_group *iommu_grp;
+   struct iommu_table *tbl;
+   struct pnv_ioda_pe *pe;
+
+   iommu_grp = iommu_group_get_by_id(iommu_grp_id);
+   if (!iommu_grp)
+   return -ENODEV;
+
+   tbl = iommu_group_get_iommudata(iommu_grp);
+   if (!tbl)
+   return -ENODEV;
+
+   pe = container_of(tbl, struct pnv_ioda_pe, tce32_table);
+   if (!pe-phb)
+   return -ENODEV;
+
+   *phb_id = pe-phb-opal_id;
+   *pe_num = pe-pe_number;
+
+   return 0;
+#endif
+
+   return -ENXIO;
+}
+
+static int 

Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-06-24 Thread Benjamin Herrenschmidt

 Is it reasonable to do error injection with CONFIG_IOMMU_API ?
 
 That means if use default config(CONFIG_IOMMU_API = n),  we can not do 
 error injection to pci devices?

Well we can't pass them through either so ...

In any case, this is not a priority. First we need to implement a solid
error injection facility for the *host*. The guest one is really really
low on the list.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-06-24 Thread Mike Qiu

On 06/24/2014 02:36 PM, Benjamin Herrenschmidt wrote:

Is it reasonable to do error injection with CONFIG_IOMMU_API ?

That means if use default config(CONFIG_IOMMU_API = n),  we can not do
error injection to pci devices?

Well we can't pass them through either so ...
In any case, this is not a priority. First we need to implement a solid
error injection facility for the *host*. The guest one is really really


OK.

Is that mean *host* side error injection should base on 
CONFIG_IOMMU_API ? If it is just host side(no guest, no pass through), 
can't we do error inject?


Maybe I misunderstand :)

Thanks
Mike

low on the list.

Cheers,
Ben.





--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-06-24 Thread Benjamin Herrenschmidt
On Tue, 2014-06-24 at 14:57 +0800, Mike Qiu wrote:
 Is that mean *host* side error injection should base on 
 CONFIG_IOMMU_API ? If it is just host side(no guest, no pass through), 
 can't we do error inject?
 
 Maybe I misunderstand :)

Ah no, make different patches, we don't want to use IOMMU group ID, just
PE numbers. Maybe we should expose in sysfs the PEs from the platform
code with the error injection files underneath ... 

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code

2014-06-24 Thread Joonsoo Kim
On Wed, Jun 18, 2014 at 01:51:44PM -0700, Andrew Morton wrote:
 On Tue, 17 Jun 2014 10:25:07 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:
 
   v2:
  - Although this patchset looks very different with v1, the end result,
  that is, mm/cma.c is same with v1's one. So I carry Ack to patch 6-7.
   
   This patchset is based on linux-next 20140610.
   
   Thanks for taking care of this. I will test it with my setup and if
   everything goes well, I will take it to my -next tree. If any branch
   is required for anyone to continue his works on top of those patches,
   let me know, I will also prepare it.
  
  Hello,
  
  I'm glad to hear that. :)
  But, there is one concern. As you already know, I am preparing further
  patches (Aggressively allocate the pages on CMA reserved memory). It
  may be highly related to MM branch and also slightly depends on this CMA
  changes. In this case, what is the best strategy to merge this
  patchset? IMHO, Anrew's tree is more appropriate branch. If there is
  no issue in this case, I am willing to develope further patches based
  on your tree.
 
 That's probably easier.  Marek, I'll merge these into -mm (and hence
 -next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
 and shall hold them pending you review/ack/test/etc, OK?

Hello, Marek.

Could you share your decision about this patchset?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule

2014-06-24 Thread Alexander Graf


On 18.06.14 09:15, Mihai Caraman wrote:

On vcpu schedule, the condition checked for tlb pollution is too loose.
The tlb entries of a vcpu become polluted (vs stale) only when a different
vcpu within the same logical partition runs in-between. Optimize the tlb
invalidation condition keeping last_vcpu per logical partition id.

With the new invalidation condition, a guest shows 4% performance improvement
on P5020DS while running a memory stress application with the cpu 
oversubscribed,
the other guest running a cpu intensive workload.

Guest - old invalidation condition
   real 3.89
   user 3.87
   sys 0.01

Guest - enhanced invalidation condition
   real 3.75
   user 3.73
   sys 0.01

Host
   real 3.70
   user 1.85
   sys 0.00

The memory stress application accesses 4KB pages backed by 75% of available
TLB0 entries:

char foo[ENTRIES][4096] __attribute__ ((aligned (4096)));

int main()
{
char bar;
int i, j;

for (i = 0; i  ITERATIONS; i++)
for (j = 0; j  ENTRIES; j++)
bar = foo[j][0];

return 0;
}

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
Cc: Scott Wood scottw...@freescale.com


Thanks, applied to kvm-ppc-queue.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 6/7] KVM: PPC: Book3S HV: Fix ABIv2 on LE

2014-06-24 Thread Alexander Graf
For code that doesn't live in modules we can just branch to the real function
names, giving us compatibility with ABIv1 and ABIv2.

Do this for the compiled-in code of HV KVM.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 1a71f60..a9de981 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -668,9 +668,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_TM)
 
mr  r31, r4
addir3, r31, VCPU_FPRS_TM
-   bl  .load_fp_state
+   bl  load_fp_state
addir3, r31, VCPU_VRS_TM
-   bl  .load_vr_state
+   bl  load_vr_state
mr  r4, r31
lwz r7, VCPU_VRSAVE_TM(r4)
mtspr   SPRN_VRSAVE, r7
@@ -1414,9 +1414,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_TM)
 
/* Save FP/VSX. */
addir3, r9, VCPU_FPRS_TM
-   bl  .store_fp_state
+   bl  store_fp_state
addir3, r9, VCPU_VRS_TM
-   bl  .store_vr_state
+   bl  store_vr_state
mfspr   r6, SPRN_VRSAVE
stw r6, VCPU_VRSAVE_TM(r9)
 1:
@@ -2405,11 +2405,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
mtmsrd  r8
isync
addir3,r3,VCPU_FPRS
-   bl  .store_fp_state
+   bl  store_fp_state
 #ifdef CONFIG_ALTIVEC
 BEGIN_FTR_SECTION
addir3,r31,VCPU_VRS
-   bl  .store_vr_state
+   bl  store_vr_state
 END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
 #endif
mfspr   r6,SPRN_VRSAVE
@@ -2441,11 +2441,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
mtmsrd  r8
isync
addir3,r4,VCPU_FPRS
-   bl  .load_fp_state
+   bl  load_fp_state
 #ifdef CONFIG_ALTIVEC
 BEGIN_FTR_SECTION
addir3,r31,VCPU_VRS
-   bl  .load_vr_state
+   bl  load_vr_state
 END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
 #endif
lwz r7,VCPU_VRSAVE(r31)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Book3E: Unlock mmu_lock when setting caching atttribute

2014-06-24 Thread Alexander Graf


On 18.06.14 17:45, Mihai Caraman wrote:

The patch 08c9a188d0d0fc0f0c5e17d89a06bb59c493110f
kvm: powerpc: use caching attributes as per linux pte
do not handle properly the error case, letting mmu_lock locked. The lock
will further generate a RCU stall from kvmppc_e500_emul_tlbwe() caller.

In case of an error go to out label.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
Cc: Bharat Bhushan bharat.bhus...@freescale.com


Thanks, applied to for-3.16.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:
 Howdy,
 
 Ben reminded me a while back that we have a nasty race in our KVM PV code.
 
 We replace a few instructions with longer streams of instructions to check
 whether it's necessary to trap out from it (like mtmsr, no need to trap if
 we only disable interrupts). During those replacement chunks we must not get
 any interrupts, because they might overwrite scratch space that we already
 used to save otherwise clobbered register state into.
 
 So we have a thing called critical sections which allows us to atomically
 get in and out of interrupt disabled modes without touching MSR. When we
 are supposed to deliver an interrupt into the guest while we are in a critical
 section, we just don't inject the interrupt yet, but leave it be until the
 next trap.
 
 However, we never really know when the next trap would be. For all we know it
 could be never. At this point we created a race that is a potential source
 for interrupt loss or at least deferral.
 
 This patch set aims at solving the race. Instead of merely deferring an
 interrupt when we see such a situation, we go into a special instruction
 interpretation mode. In this mode, we interpret all PPC assembler instructions
 that happen until we are out of the critical section again, at which point
 we can now inject the interrupt.
 
 This bug only affects KVM implementations that make use of the magic page, so
 e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Alexander Graf


On 24.06.14 20:53, Scott Wood wrote:

On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:

Howdy,

Ben reminded me a while back that we have a nasty race in our KVM PV code.

We replace a few instructions with longer streams of instructions to check
whether it's necessary to trap out from it (like mtmsr, no need to trap if
we only disable interrupts). During those replacement chunks we must not get
any interrupts, because they might overwrite scratch space that we already
used to save otherwise clobbered register state into.

So we have a thing called critical sections which allows us to atomically
get in and out of interrupt disabled modes without touching MSR. When we
are supposed to deliver an interrupt into the guest while we are in a critical
section, we just don't inject the interrupt yet, but leave it be until the
next trap.

However, we never really know when the next trap would be. For all we know it
could be never. At this point we created a race that is a potential source
for interrupt loss or at least deferral.

This patch set aims at solving the race. Instead of merely deferring an
interrupt when we see such a situation, we go into a special instruction
interpretation mode. In this mode, we interpret all PPC assembler instructions
that happen until we are out of the critical section again, at which point
we can now inject the interrupt.

This bug only affects KVM implementations that make use of the magic page, so
e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?


There are a few other alternatives to this implementation:

  1) Unmap the magic page, emulate all memory access to it while in 
critical and irq pending
  2) Trigger a timer that sends a request to the vcpu to wake it from 
potential sleep and inject the irq

  3) Single step until we're beyond the critical section
  4) Probably more that I can't think of right now :)

Each has their good and bad sides. Unmapping the magic page adds 
complexity to the MMU mapping code, since we need to make sure we don't 
map it back in on demand and treat faults to it specially.


The timer interrupt works, but I'm not fully convinced that it's a good 
idea for things like MC events which we also block during critical 
sections on e500v2.


Single stepping is hard enough to get right on interaction between QEMU, 
KVM and the guest. I didn't really want to make that stuff any more 
complicated.


This approach is really just one out of many - and it's one that's 
nicely self-contained and shouldn't have any impact at all on 
implementations that don't care about it ;).



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:
 On 24.06.14 20:53, Scott Wood wrote:
  On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:
  Howdy,
 
  Ben reminded me a while back that we have a nasty race in our KVM PV code.
 
  We replace a few instructions with longer streams of instructions to check
  whether it's necessary to trap out from it (like mtmsr, no need to trap if
  we only disable interrupts). During those replacement chunks we must not 
  get
  any interrupts, because they might overwrite scratch space that we already
  used to save otherwise clobbered register state into.
 
  So we have a thing called critical sections which allows us to atomically
  get in and out of interrupt disabled modes without touching MSR. When we
  are supposed to deliver an interrupt into the guest while we are in a 
  critical
  section, we just don't inject the interrupt yet, but leave it be until the
  next trap.
 
  However, we never really know when the next trap would be. For all we know 
  it
  could be never. At this point we created a race that is a potential source
  for interrupt loss or at least deferral.
 
  This patch set aims at solving the race. Instead of merely deferring an
  interrupt when we see such a situation, we go into a special instruction
  interpretation mode. In this mode, we interpret all PPC assembler 
  instructions
  that happen until we are out of the critical section again, at which point
  we can now inject the interrupt.
 
  This bug only affects KVM implementations that make use of the magic page, 
  so
  e500v2, book3s_32 and book3s_64 PR KVM.
  Would it be possible to single step through the critical section
  instead?  Or set a high res timer to expire very quickly?
 
 There are a few other alternatives to this implementation:
 
1) Unmap the magic page, emulate all memory access to it while in 
 critical and irq pending
2) Trigger a timer that sends a request to the vcpu to wake it from 
 potential sleep and inject the irq
3) Single step until we're beyond the critical section
4) Probably more that I can't think of right now :)
 
 Each has their good and bad sides. Unmapping the magic page adds 
 complexity to the MMU mapping code, since we need to make sure we don't 
 map it back in on demand and treat faults to it specially.
 
 The timer interrupt works, but I'm not fully convinced that it's a good 
 idea for things like MC events which we also block during critical 
 sections on e500v2.

Are you concerned about the guest seeing machine checks that are (more)
asynchronous with the error condition?  e500v2 machine checks are always
asynchronous.  From the core manual:

Machine check interrupts are typically caused by a hardware or
memory subsystem failure or by an attempt to access an invalid
address. They may be caused indirectly by execution of an
instruction, but may not be recognized or reported until long
after the processor has executed past the instruction that
caused the machine check. As such, machine check interrupts are
not thought of as synchronous or asynchronous nor as precise or
imprecise.

I don't think the lag would be a problem, and certainly it's better than
the current situation.

 Single stepping is hard enough to get right on interaction between QEMU, 
 KVM and the guest. I didn't really want to make that stuff any more 
 complicated.

I'm not sure that it would add much complexity.  We'd just need to check
whether any source other than the magic page turned wants DCBR0_IC on,
to determine whether to exit to userspace or not.

 This approach is really just one out of many - and it's one that's 
 nicely self-contained and shouldn't have any impact at all on 
 implementations that don't care about it ;).

Nicely self-contained is not a phrase I'd associate with 33 patches,
including a bunch of new emulation that probably isn't getting great
test coverage.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Alexander Graf


On 25.06.14 01:15, Scott Wood wrote:

On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:

On 24.06.14 20:53, Scott Wood wrote:

On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:

Howdy,

Ben reminded me a while back that we have a nasty race in our KVM PV code.

We replace a few instructions with longer streams of instructions to check
whether it's necessary to trap out from it (like mtmsr, no need to trap if
we only disable interrupts). During those replacement chunks we must not get
any interrupts, because they might overwrite scratch space that we already
used to save otherwise clobbered register state into.

So we have a thing called critical sections which allows us to atomically
get in and out of interrupt disabled modes without touching MSR. When we
are supposed to deliver an interrupt into the guest while we are in a critical
section, we just don't inject the interrupt yet, but leave it be until the
next trap.

However, we never really know when the next trap would be. For all we know it
could be never. At this point we created a race that is a potential source
for interrupt loss or at least deferral.

This patch set aims at solving the race. Instead of merely deferring an
interrupt when we see such a situation, we go into a special instruction
interpretation mode. In this mode, we interpret all PPC assembler instructions
that happen until we are out of the critical section again, at which point
we can now inject the interrupt.

This bug only affects KVM implementations that make use of the magic page, so
e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?

There are a few other alternatives to this implementation:

1) Unmap the magic page, emulate all memory access to it while in
critical and irq pending
2) Trigger a timer that sends a request to the vcpu to wake it from
potential sleep and inject the irq
3) Single step until we're beyond the critical section
4) Probably more that I can't think of right now :)

Each has their good and bad sides. Unmapping the magic page adds
complexity to the MMU mapping code, since we need to make sure we don't
map it back in on demand and treat faults to it specially.

The timer interrupt works, but I'm not fully convinced that it's a good
idea for things like MC events which we also block during critical
sections on e500v2.

Are you concerned about the guest seeing machine checks that are (more)
asynchronous with the error condition?  e500v2 machine checks are always
asynchronous.  From the core manual:

 Machine check interrupts are typically caused by a hardware or
 memory subsystem failure or by an attempt to access an invalid
 address. They may be caused indirectly by execution of an
 instruction, but may not be recognized or reported until long
 after the processor has executed past the instruction that
 caused the machine check. As such, machine check interrupts are
 not thought of as synchronous or asynchronous nor as precise or
 imprecise.

I don't think the lag would be a problem, and certainly it's better than
the current situation.


So what value would you set the timer to? If the value is too small, we 
never finish the critical section. If it's too big, we add lots of jitter.





Single stepping is hard enough to get right on interaction between QEMU,
KVM and the guest. I didn't really want to make that stuff any more
complicated.

I'm not sure that it would add much complexity.  We'd just need to check
whether any source other than the magic page turned wants DCBR0_IC on,
to determine whether to exit to userspace or not.


What if the guest is single stepping itself? How do we determine when to 
unset the bit again? When we get out of the critical section? How do we 
know what the value was before we set it?





This approach is really just one out of many - and it's one that's
nicely self-contained and shouldn't have any impact at all on
implementations that don't care about it ;).

Nicely self-contained is not a phrase I'd associate with 33 patches,
including a bunch of new emulation that probably isn't getting great
test coverage.


It means that there's only a single entry point for when the code gets 
executed, not that it's very little code.


Eventually this emulation code should get merged with the already 
existing in-kernel emulation code. Paul had already started work to 
merge the emulators a while ago. He even measured speedups when he sent 
all real mode and split real mode code via the interpreter rather than 
the entry/exit dance we do today.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-06-24 Thread Gavin Shan
On Tue, Jun 24, 2014 at 05:00:52PM +1000, Benjamin Herrenschmidt wrote:
On Tue, 2014-06-24 at 14:57 +0800, Mike Qiu wrote:
 Is that mean *host* side error injection should base on 
 CONFIG_IOMMU_API ? If it is just host side(no guest, no pass through), 
 can't we do error inject?
 
 Maybe I misunderstand :)

Ah no, make different patches, we don't want to use IOMMU group ID, just
PE numbers. Maybe we should expose in sysfs the PEs from the platform
code with the error injection files underneath ... 


Yeah, errinjct needs grab PCI_domain_nr+PE number from sysfs. We
already had PE number sysfs file:

[root@ltcfbl8eb :01:00.1]# pwd
/sys/bus/pci/devices/:01:00.1
[root@ltcfbl8eb :01:00.1]# cat eeh_pe_config_addr 
0x1

For guest support, we will rely on VFIO group ioctl command, which
naturally depends on pass-through.

---

We probably implement it like this. If there're anything wrong, please
correct me:

- Introduce EEH callback struct eeh_ops::err_inject(), which will be
  implemented for PowerNV (NULL for pSeries) by calling the PCI error
  injection dedicated OPAL API (opal_pci_err_inject()).
- Introduce global function eeh.c::eeh_err_inject(), which calls to
  eeh_ops::err_inject() and newly introduced VFIO EEH operation
  will be implemented based on this function.
- Introduce debugfs /sys/kernel/debug/powerpc/PCI/errinjct, which
  receives PCI error injection parameters from errinjct. It could
  have format: ei_token:addr:mask:PCI_domain_nr:PE_num:function.
  Eventually, eeh_err_inject() is invoked to call the corresponding
  OPAL API.

Thanks,
Gavin

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-06-24 Thread Gavin Shan
On Mon, Jun 23, 2014 at 04:36:44PM +1000, Michael Neuling wrote:
On Mon, 2014-06-23 at 12:14 +1000, Gavin Shan wrote:
 The patch implements one OPAL firmware sysfs file to support PCI error
 injection: /sys/firmware/opal/errinjct, which will be used like the
 way described as follows.
 
 According to PAPR spec, there are 3 RTAS calls related to error injection:
 ibm,open-errinjct: allocate token prior to doing error injection.
 ibm,close-errinjct: release the token allocated from ibm,open-errinjct.
 ibm,errinjct: do error injection.
 
 Sysfs file /sys/firmware/opal/errinjct accepts strings that have fixed
 format ei_token  For now, we only support 32-bits and 64-bits
 PCI error injection and they should have following strings written to
 /sys/firmware/opal/errinjct as follows. We don't have corresponding
 sysfs files for ibm,open-errinjct and ibm,close-errinjct, which
 means that we rely on userland to maintain the token by itself.

This sounds cool.  

Can you document the sysfs interface in Documentation/powerpc?


Yeah, Documentation/powerpc/eeh-pci-error-recovery.txt needs update
as Ben suggested. It's something in my list :-)

Thanks,
Gavin

 
 32-bits PCI error: 7:addr:mask:iommu_group_id:function.
 64-bits PCI error: 8:addr:mask:iommu_group_id:function.
 
 The above 7 and 8 represent 32-bits and 64-bits PCI error seperately
 and function is one of the specific PCI errors (e.g. MMIO access address
 parity error), which are defined by PAPR spec.
 
 Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
 ---
  arch/powerpc/include/asm/opal.h|   1 +
  arch/powerpc/platforms/powernv/Makefile|   2 +-
  arch/powerpc/platforms/powernv/opal-errinjct.c | 184 
 +
  arch/powerpc/platforms/powernv/opal.c  |   2 +
  4 files changed, 188 insertions(+), 1 deletion(-)
  create mode 100644 arch/powerpc/platforms/powernv/opal-errinjct.c
 
 diff --git a/arch/powerpc/include/asm/opal.h 
 b/arch/powerpc/include/asm/opal.h
 index d982bb8..bf280d9 100644
 --- a/arch/powerpc/include/asm/opal.h
 +++ b/arch/powerpc/include/asm/opal.h
 @@ -985,6 +985,7 @@ extern int opal_elog_init(void);
  extern void opal_platform_dump_init(void);
  extern void opal_sys_param_init(void);
  extern void opal_msglog_init(void);
 +extern void opal_errinjct_init(void);
  
  extern int opal_machine_check(struct pt_regs *regs);
  extern bool opal_mce_check_early_recovery(struct pt_regs *regs);
 diff --git a/arch/powerpc/platforms/powernv/Makefile 
 b/arch/powerpc/platforms/powernv/Makefile
 index 63cebb9..4711de8 100644
 --- a/arch/powerpc/platforms/powernv/Makefile
 +++ b/arch/powerpc/platforms/powernv/Makefile
 @@ -1,7 +1,7 @@
  obj-y   += setup.o opal-takeover.o opal-wrappers.o 
 opal.o opal-async.o
  obj-y   += opal-rtc.o opal-nvram.o opal-lpc.o 
 opal-flash.o
  obj-y   += rng.o opal-elog.o opal-dump.o 
 opal-sysparam.o opal-sensor.o
 -obj-y   += opal-msglog.o
 +obj-y   += opal-msglog.o opal-errinjct.o
  
  obj-$(CONFIG_SMP)   += smp.o
  obj-$(CONFIG_PCI)   += pci.o pci-p5ioc2.o pci-ioda.o
 diff --git a/arch/powerpc/platforms/powernv/opal-errinjct.c 
 b/arch/powerpc/platforms/powernv/opal-errinjct.c
 new file mode 100644
 index 000..29c9e83
 --- /dev/null
 +++ b/arch/powerpc/platforms/powernv/opal-errinjct.c
 @@ -0,0 +1,184 @@
 +/*
 + * The file supports error injection, which works based on OPAL API.
 + * For now, we only support PCI error injection. We need support
 + * injecting other types of errors in future.
 + *
 + * Copyright Gavin Shan, IBM Corporation 2014.
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License as published by
 + * the Free Software Foundation; either version 2 of the License, or
 + * (at your option) any later version.
 + */
 +
 +#include linux/kernel.h
 +#include linux/init.h
 +#include linux/msi.h
 +#include linux/pci.h
 +#include linux/iommu.h
 +#include linux/random.h
 +#include linux/slab.h
 +#include linux/sysfs.h
 +#include linux/fs.h
 +#include linux/fcntl.h
 +#include linux/kobject.h
 +
 +#include asm/msi_bitmap.h
 +#include asm/iommu.h
 +#include asm/pci-bridge.h
 +#include asm/ppc-pci.h
 +#include asm/opal.h
 +
 +#include powernv.h
 +#include pci.h
 +
 +static DEFINE_MUTEX(errinjct_mutex);
 +
 +static int errinjct_iommu_group_to_phb_and_pe(uint32_t iommu_grp_id,
 +  uint64_t *phb_id,
 +  uint32_t *pe_num)
 +{
 +#ifdef CONFIG_IOMMU_API
 +struct iommu_group *iommu_grp;
 +struct iommu_table *tbl;
 +struct pnv_ioda_pe *pe;
 +
 +iommu_grp = iommu_group_get_by_id(iommu_grp_id);
 +if (!iommu_grp)
 +return -ENODEV;
 +
 +tbl = iommu_group_get_iommudata(iommu_grp);
 +if (!tbl)
 +return -ENODEV;
 +
 +pe = container_of(tbl, struct pnv_ioda_pe, 

Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Wed, 2014-06-25 at 01:40 +0200, Alexander Graf wrote:
 On 25.06.14 01:15, Scott Wood wrote:
  On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:
  On 24.06.14 20:53, Scott Wood wrote:
  The timer interrupt works, but I'm not fully convinced that it's a good
  idea for things like MC events which we also block during critical
  sections on e500v2.
  Are you concerned about the guest seeing machine checks that are (more)
  asynchronous with the error condition?  e500v2 machine checks are always
  asynchronous.  From the core manual:
 
   Machine check interrupts are typically caused by a hardware or
   memory subsystem failure or by an attempt to access an invalid
   address. They may be caused indirectly by execution of an
   instruction, but may not be recognized or reported until long
   after the processor has executed past the instruction that
   caused the machine check. As such, machine check interrupts are
   not thought of as synchronous or asynchronous nor as precise or
   imprecise.
 
  I don't think the lag would be a problem, and certainly it's better than
  the current situation.
 
 So what value would you set the timer to? If the value is too small, we 
 never finish the critical section. If it's too big, we add lots of jitter.

Maybe something like 100us?

Single stepping would be better, though.

  Single stepping is hard enough to get right on interaction between QEMU,
  KVM and the guest. I didn't really want to make that stuff any more
  complicated.
  I'm not sure that it would add much complexity.  We'd just need to check
  whether any source other than the magic page turned wants DCBR0_IC on,
  to determine whether to exit to userspace or not.
 
 What if the guest is single stepping itself? How do we determine when to 
 unset the bit again? When we get out of the critical section? How do we 
 know what the value was before we set it?

Keep track of each requester of single stepping separately, and only
ever set the real bit by ORing them.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-06-24 Thread Mike Qiu

On 06/25/2014 08:03 AM, Gavin Shan wrote:

On Tue, Jun 24, 2014 at 05:00:52PM +1000, Benjamin Herrenschmidt wrote:

On Tue, 2014-06-24 at 14:57 +0800, Mike Qiu wrote:

Is that mean *host* side error injection should base on
CONFIG_IOMMU_API ? If it is just host side(no guest, no pass through),
can't we do error inject?

Maybe I misunderstand :)

Ah no, make different patches, we don't want to use IOMMU group ID, just
PE numbers. Maybe we should expose in sysfs the PEs from the platform
code with the error injection files underneath ...


Yeah, errinjct needs grab PCI_domain_nr+PE number from sysfs. We
already had PE number sysfs file:

[root@ltcfbl8eb :01:00.1]# pwd
/sys/bus/pci/devices/:01:00.1
[root@ltcfbl8eb :01:00.1]# cat eeh_pe_config_addr
0x1

For guest support, we will rely on VFIO group ioctl command, which
naturally depends on pass-through.

---

We probably implement it like this. If there're anything wrong, please
correct me:

- Introduce EEH callback struct eeh_ops::err_inject(), which will be
   implemented for PowerNV (NULL for pSeries) by calling the PCI error
   injection dedicated OPAL API (opal_pci_err_inject()).
- Introduce global function eeh.c::eeh_err_inject(), which calls to
   eeh_ops::err_inject() and newly introduced VFIO EEH operation
   will be implemented based on this function.
- Introduce debugfs /sys/kernel/debug/powerpc/PCI/errinjct, which


Here maybe  /sys/kernel/debug/powerpc/errinjct is better, because it 
will supply PCI_domain_nr in parameters, so no need supply errinjct 
for each PCI domain.


Another reason is error inject not only for PCI(in future), so better 
not in PCI domain entry.


Also it simple for userland tools to has a fixed path.

Thanks
Mike


   receives PCI error injection parameters from errinjct. It could
   have format: ei_token:addr:mask:PCI_domain_nr:PE_num:function.
   Eventually, eeh_err_inject() is invoked to call the corresponding
   OPAL API.

Thanks,
Gavin



--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/3] powerpc/powernv: Support PCI error injection

2014-06-24 Thread Benjamin Herrenschmidt
On Wed, 2014-06-25 at 11:05 +0800, Mike Qiu wrote:
 Here maybe  /sys/kernel/debug/powerpc/errinjct is better, because
 it 
 will supply PCI_domain_nr in parameters, so no need supply errinjct 
 for each PCI domain.
 
 Another reason is error inject not only for PCI(in future), so better 
 not in PCI domain entry.
 
 Also it simple for userland tools to has a fixed path.

I don't like this. I much prefer have dedicated error injection files
in their respective locations, something for PCI under the corresponding
PCI bridge etc...

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Bugfix: powerpc/eeh: Create eeh sysfs entry in post_init()

2014-06-24 Thread Mike Qiu
Eeh sysfs entry created must be after EEH_ENABLED been set
in eeh_subsystem_flags.

In PowerNV platform, it try to create sysfs entry before
EEH_ENABLED been set, when boot up. So nothing will be
created for eeh in sysfs.

Signed-off-by: Mike Qiu qiud...@linux.vnet.ibm.com
---
 arch/powerpc/platforms/powernv/eeh-ioda.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c 
b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 8ad0c5b..5f95581 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -136,6 +136,9 @@ static int ioda_eeh_post_init(struct pci_controller *hose)
struct pnv_phb *phb = hose-private_data;
int ret;
 
+   /* Creat sysfs after EEH_ENABLED been set */
+   eeh_add_sysfs_files(hose-bus);
+
/* Register OPAL event notifier */
if (!ioda_eeh_nb_init) {
ret = opal_notifier_register(ioda_eeh_nb);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Bugfix: powerpc/eeh: Create eeh sysfs entry in post_init()

2014-06-24 Thread Gavin Shan
On Tue, Jun 24, 2014 at 11:32:07PM -0400, Mike Qiu wrote:

[ cc Richard ]

Eeh sysfs entry created must be after EEH_ENABLED been set
in eeh_subsystem_flags.

In PowerNV platform, it try to create sysfs entry before
EEH_ENABLED been set, when boot up. So nothing will be
created for eeh in sysfs.


Could you please make the commit log more clear? :-)

I guess the issue is introduced by commit 2213fb1 (
powerpc/eeh: Skip eeh sysfs when eeh is disabled). The
commit checks EEH is enabled while creating PCI device
EEH sysfs files. If not, the sysfs files won't be created.
That's to avoid warning reported during PCI hotplug.

The problem you're reporting (if I understand completely):
You don't see the sysfs files after the system boots up.
If it's the case, you probably need following changes in
arch/powerpc/platforms/powernv/pci.c::pnv_pci_ioda_fixup().
Could you have a try with it?

#ifdef CONFIG_EEH
eeh_probe_mode_set(EEH_PROBE_MODE_DEV);
-   eeh_addr_cache_build();
eeh_init();
+   eeh_addr_cache_build();
#endif

Eventually PowerNV/pSeries have same function call sequence:

- Set EEH probe mode
- Doing probe (with device node or PCI device)
- Build address cache.
 
Signed-off-by: Mike Qiu qiud...@linux.vnet.ibm.com
---
 arch/powerpc/platforms/powernv/eeh-ioda.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c 
b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 8ad0c5b..5f95581 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -136,6 +136,9 @@ static int ioda_eeh_post_init(struct pci_controller *hose)
   struct pnv_phb *phb = hose-private_data;
   int ret;

+  /* Creat sysfs after EEH_ENABLED been set */
+  eeh_add_sysfs_files(hose-bus);
+
   /* Register OPAL event notifier */
   if (!ioda_eeh_nb_init) {
   ret = opal_notifier_register(ioda_eeh_nb);

Thanks,
Gavin

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html