date:20140520

Re: [PATCH 3/8] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Gavin Shan

On Mon, May 19, 2014 at 06:37:24PM -0600, Alex Williamson wrote:
>On Tue, 2014-05-20 at 10:22 +1000, Gavin Shan wrote:
>> On Mon, May 19, 2014 at 04:33:10PM -0600, Alex Williamson wrote:
>> >On Wed, 2014-05-14 at 14:11 +1000, Gavin Shan wrote:
>> >> The patch adds new IOCTL command VFIO_EEH_INFO to VFIO container
>> >> to support EEH functionality for PCI devices, which have been
>> >> passed from host to guest via VFIO.
>> 
>> Thanks for your comments, Alex.W :-)
>> 
>> >
>> >Some comments throughout, but overall this seems to forgo every bit of
>> >the device ownership and protection model used by VFIO and lets the user
>> >pick arbitrary host devices and do various operations, mostly unchecked.
>> >That's not acceptable.
>> >
>> 
>> As what I replied to patch[2], I'm going to let VFIO-PCI-dev fd handle
>> the newly introduced IOCTL command. That way, we should follow the VFIO
>> design principles (ownership and protection) because VFIO-PCI-dev fd
>> is owned by QEMU process usually.
>> 
>> Also, the address mapping maintained in EEH will be removed.
>> 
>> >> Signed-off-by: Gavin Shan 
>> >> ---
>> >>  arch/powerpc/platforms/powernv/Makefile   |   1 +
>> >>  arch/powerpc/platforms/powernv/eeh-vfio.c | 593 
>> >> ++
>> >>  drivers/vfio/vfio_iommu_spapr_tce.c   |  12 +
>> >>  include/uapi/linux/vfio.h |  57 +++
>> >>  4 files changed, 663 insertions(+)
>> >>  create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c
>> >> 
>> >> diff --git a/arch/powerpc/platforms/powernv/Makefile 
>> >> b/arch/powerpc/platforms/powernv/Makefile
>> >> index 63cebb9..2b15a03 100644
>> >> --- a/arch/powerpc/platforms/powernv/Makefile
>> >> +++ b/arch/powerpc/platforms/powernv/Makefile
>> >> @@ -6,5 +6,6 @@ obj-y += opal-msglog.o
>> >>  obj-$(CONFIG_SMP)+= smp.o
>> >>  obj-$(CONFIG_PCI)+= pci.o pci-p5ioc2.o pci-ioda.o
>> >>  obj-$(CONFIG_EEH)+= eeh-ioda.o eeh-powernv.o
>> >> +obj-$(CONFIG_VFIO_EEH)   += eeh-vfio.o
>> >>  obj-$(CONFIG_PPC_SCOM)   += opal-xscom.o
>> >>  obj-$(CONFIG_MEMORY_FAILURE) += opal-memory-errors.o
>> >> diff --git a/arch/powerpc/platforms/powernv/eeh-vfio.c 
>> >> b/arch/powerpc/platforms/powernv/eeh-vfio.c
>> >> new file mode 100644
>> >> index 000..69d5f2d
>> >> --- /dev/null
>> >> +++ b/arch/powerpc/platforms/powernv/eeh-vfio.c
>> >> @@ -0,0 +1,593 @@
>> >> +/*
>> >> +  * The file intends to support EEH funtionality for those PCI devices,
>> >> +  * which have been passed through from host to guest via VFIO. So this
>> >> +  * file is naturally part of VFIO implementation on PowerNV platform.
>> >> +  *
>> >> +  * Copyright Benjamin Herrenschmidt & Gavin Shan, IBM Corporation 2014.
>> >> +  *
>> >> +  * This program is free software; you can redistribute it and/or modify
>> >> +  * it under the terms of the GNU General Public License as published by
>> >> +  * the Free Software Foundation; either version 2 of the License, or
>> >> +  * (at your option) any later version.
>> >> +  */
>> >> +
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +
>> >> +#include "powernv.h"
>> >> +#include "pci.h"
>> >> +
>> >> +static int powernv_eeh_vfio_map(struct vfio_eeh_info *info)
>> >> +{
>> >> + struct pci_bus *bus, *pe_bus;
>> >> + struct pci_dev *pdev;
>> >> + struct eeh_dev *edev;
>> >> + struct eeh_pe *pe;
>> >> + int domain, bus_no, devfn;
>> >> +
>> >> + /* Host address */
>> >> + domain = info->map.host_domain;
>> >> + bus_no = (info->map.host_cfg_addr >> 8) & 0xff;
>> >> + devfn = info->map.host_cfg_addr & 0xff;
>> >
>> >Where are we validating that the user has any legitimate claim to be
>> >touching this device?
>> >
>> 
>> I'll let VFIO-PCI-dev fd handle the IOCTL command. With that, we shouldn't
>> have the problem.
>> 
>> >> + /* Find PCI bus */
>> >> + bus = pci_find_bus(domain, bus_no);
>> >> + if (!bus) {
>> >> + pr_warn("%s: PCI bus %04x:%02x not found\n",
>> >> + __func__, domain, bus_no);
>> >> + return -ENODEV;
>> >> + }
>> >> +
>> >> + /* Find PCI device */
>> >> + pdev = pci_get_slot(bus, devfn);
>> >> + if (!pdev) {
>> >> + pr_warn("%s: PCI device %04x:%02x:%02x.%01x not found\n",
>> >> + __func__, domain, bus_no,
>> >> + PCI_SLOT(devfn), PCI_FUNC(devfn));
>> >> + return -ENODEV;
>> >> + }
>> >> +
>> >> + /* No EEH device - almost impossible */
>> >> + edev = pci_dev_to_eeh_dev(pdev);
>> >> + if (unlikely(!edev)) {
>> >> + pci_dev_put(pdev);
>> >> + pr_warn("%s: No EEH dev for PCI device %s\n",
>> >> + __func__, pci_name(pdev));
>> >> + return -ENODEV;
>> >> + }
>> >> +
>> >> + /* Doesn't

[PATCH 2/4] powerpc/eeh: Flags for passed device and PE

2014-05-20 Thread Gavin Shan

The patch introduces new flags for EEH device and PE to indicate
that the device or PE has been passed through to guest. In turn,
we will deliver EEH errors to guest for further handling, which
will be done in subsequent patches.

Signed-off-by: Gavin Shan 
---
 arch/powerpc/include/asm/eeh.h | 32 
 1 file changed, 32 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 7782056..34a2d83 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -72,6 +72,7 @@ struct device_node;
 #define EEH_PE_RESET   (1 << 2)/* PE reset in progress */
 
 #define EEH_PE_KEEP(1 << 8)/* Keep PE on hotplug   */
+#define EEH_PE_PASSTHROUGH (1 << 9)/* PE owned by guest*/
 
 struct eeh_pe {
int type;   /* PE type: PHB/Bus/Device  */
@@ -93,6 +94,21 @@ struct eeh_pe {
 #define eeh_pe_for_each_dev(pe, edev, tmp) \
list_for_each_entry_safe(edev, tmp, &pe->edevs, list)
 
+static inline bool eeh_pe_passed(struct eeh_pe *pe)
+{
+   return pe ? !!(pe->state & EEH_PE_PASSTHROUGH) : false;
+}
+
+static inline void eeh_pe_set_passed(struct eeh_pe *pe, bool passed)
+{
+   if (pe) {
+   if (passed)
+   pe->state |= EEH_PE_PASSTHROUGH;
+   else
+   pe->state &= ~EEH_PE_PASSTHROUGH;
+   }
+}
+
 /*
  * The struct is used to trace EEH state for the associated
  * PCI device node or PCI device. In future, it might
@@ -110,6 +126,7 @@ struct eeh_pe {
 #define EEH_DEV_SYSFS  (1 << 9)/* Sysfs created*/
 #define EEH_DEV_REMOVED(1 << 10)   /* Removed permanently  
*/
 #define EEH_DEV_FRESET (1 << 11)   /* Fundamental reset*/
+#define EEH_DEV_PASSTHROUGH(1 << 12)   /* Owned by guest   */
 
 struct eeh_dev {
int mode;   /* EEH mode */
@@ -138,6 +155,21 @@ static inline struct pci_dev *eeh_dev_to_pci_dev(struct 
eeh_dev *edev)
return edev ? edev->pdev : NULL;
 }
 
+static inline bool eeh_dev_passed(struct eeh_dev *dev)
+{
+   return dev ? !!(dev->mode & EEH_DEV_PASSTHROUGH) : false;
+}
+
+static inline void eeh_dev_set_passed(struct eeh_dev *dev, bool passed)
+{
+   if (dev) {
+   if (passed)
+   dev->mode |= EEH_DEV_PASSTHROUGH;
+   else
+   dev->mode &= ~EEH_DEV_PASSTHROUGH;
+   }
+}
+
 /* Return values from eeh_ops::next_error */
 enum {
EEH_NEXT_ERR_NONE = 0,
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] drivers/vfio: Introduce CONFIG_VFIO_PCI_EEH

2014-05-20 Thread Gavin Shan

The patch introduces CONFIG_VFIO_PCI_EEH for more IOCTL commands
on VFIO PCI device support EEH funtionality for PCI devices that
are passed through from host to guest.

Signed-off-by: Gavin Shan 
---
 drivers/vfio/pci/Kconfig | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index c41b01e..dd0e0f0 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -1,6 +1,7 @@
 config VFIO_PCI
tristate "VFIO support for PCI devices"
depends on VFIO && PCI && EVENTFD
+   select VFIO_PCI_EEH if PPC_POWERNV
help
  Support for the PCI VFIO bus driver.  This is required to make
  use of PCI drivers using the VFIO framework.
@@ -16,3 +17,8 @@ config VFIO_PCI_VGA
  BIOS and generic video drivers.
 
  If you don't know what to do here, say N.
+
+config VFIO_PCI_EEH
+   tristate
+   depends on EEH && VFIO_IOMMU_SPAPR_TCE
+   default n
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Gavin Shan

The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
to support EEH functionality for PCI devices, which have been
passed from host to guest via VFIO.

Signed-off-by: Gavin Shan 
---
 arch/powerpc/platforms/powernv/Makefile   |   1 +
 arch/powerpc/platforms/powernv/eeh-vfio.c | 445 ++
 drivers/vfio/pci/vfio_pci.c   |  24 +-
 drivers/vfio/pci/vfio_pci_private.h   |  16 ++
 include/uapi/linux/vfio.h |  43 +++
 5 files changed, 523 insertions(+), 6 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c

diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index 63cebb9..45cd833 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -6,5 +6,6 @@ obj-y   += opal-msglog.o
 obj-$(CONFIG_SMP)  += smp.o
 obj-$(CONFIG_PCI)  += pci.o pci-p5ioc2.o pci-ioda.o
 obj-$(CONFIG_EEH)  += eeh-ioda.o eeh-powernv.o
+obj-$(CONFIG_VFIO_PCI_EEH) += eeh-vfio.o
 obj-$(CONFIG_PPC_SCOM) += opal-xscom.o
 obj-$(CONFIG_MEMORY_FAILURE)   += opal-memory-errors.o
diff --git a/arch/powerpc/platforms/powernv/eeh-vfio.c 
b/arch/powerpc/platforms/powernv/eeh-vfio.c
new file mode 100644
index 000..11adc55
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/eeh-vfio.c
@@ -0,0 +1,445 @@
+/*
+  * The file intends to support EEH funtionality for those PCI devices,
+  * which have been passed through from host to guest via VFIO. So this
+  * file is naturally part of VFIO implementation on PowerNV platform.
+  *
+  * Copyright Benjamin Herrenschmidt & Gavin Shan, IBM Corporation 2014.
+  *
+  * This program is free software; you can redistribute it and/or modify
+  * it under the terms of the GNU General Public License as published by
+  * the Free Software Foundation; either version 2 of the License, or
+  * (at your option) any later version.
+  */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "powernv.h"
+#include "pci.h"
+
+static int powernv_eeh_vfio_check_dev(struct pci_dev *pdev,
+ struct eeh_dev **pedev,
+ struct eeh_pe **ppe,
+ struct pnv_phb **pphb)
+{
+   struct eeh_dev *edev;
+   struct pnv_phb *phb;
+
+   /* No device ? */
+   if (!pdev)
+   return -ENODEV;
+
+   edev = pci_dev_to_eeh_dev(pdev);
+   if (!edev || !eeh_dev_passed(edev) ||
+   !edev->pe || !eeh_pe_passed(edev->pe))
+   return -ENODEV;
+
+   /* EEH isn't supported ? */
+   phb = edev->phb->private_data;
+   if (!(phb->flags & PNV_PHB_FLAG_EEH))
+   return -EACCES;
+
+   if (pedev)
+   *pedev = edev;
+   if (ppe)
+   *ppe = edev->pe;
+   if (pphb)
+   *pphb = phb;
+
+   return 0;
+}
+
+static int powernv_eeh_vfio_set_option(struct pci_dev *pdev,
+  struct vfio_eeh_op *info)
+{
+   struct eeh_dev *edev;
+   struct eeh_pe *pe;
+   struct pnv_phb *phb;
+   int opcode = info->option.option;
+   int ret = 0;
+
+   /* Device existing ? */
+   ret = powernv_eeh_vfio_check_dev(pdev, &edev, &pe, &phb);
+   if (ret) {
+   pr_debug("%s: Cannot find device\n",
+   __func__);
+   info->option.ret = -7;
+   goto out;
+   }
+
+   /* Invalid opcode ? */
+   if (opcode < EEH_OPT_DISABLE ||
+   opcode > EEH_OPT_THAW_DMA) {
+   pr_debug("%s: Opcode#%d out of range (%d, %d)\n",
+__func__, opcode, EEH_OPT_DISABLE, EEH_OPT_THAW_DMA);
+   info->option.ret = -3;
+   ret = -EINVAL;
+   goto out;
+   }
+
+   if (opcode == EEH_OPT_DISABLE ||
+   opcode == EEH_OPT_ENABLE) {
+   info->option.ret = 0;
+   } else {
+   if (!phb->eeh_ops || !phb->eeh_ops->set_option) {
+   info->option.ret = -7;
+   ret = -ENOENT;
+   goto out;
+   }
+
+   ret = phb->eeh_ops->set_option(pe, opcode);
+   if (ret) {
+   pr_debug("%s: Failure %d from backend\n",
+   __func__, ret);
+   info->option.ret = -3;
+   goto out;
+   }
+
+   info->option.ret = 0;
+   }
+out:
+   return ret;
+}
+
+static int powernv_eeh_vfio_get_addr(struct pci_dev *pdev,
+struct vfio_eeh_op *info)
+{
+   struct pci_bus *bus;
+   struct eeh_dev *edev;
+   struct eeh_pe *pe;
+   struct pnv_phb *phb;
+   int o

[PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Gavin Shan

If we detects frozen state on PE that has been passed to guest, we
needn't handle it. Instead, we rely on the guest to detect and recover
it. The patch avoid EEH event on the frozen passed PE so that the guest
can have chance to handle that.

Signed-off-by: Gavin Shan 
---
 arch/powerpc/kernel/eeh.c | 8 
 arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 9c6b899..6543f05 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
if (ret > 0)
return ret;
 
+   /*
+* If the PE has been passed to guest, we won't check the
+* state. Instead, let the guest handle it if the PE has
+* been frozen.
+*/
+   if (eeh_pe_passed(pe))
+   return 0;
+
/* If we already have a pending isolation event for this
 * slot, we know it's bad already, we don't need to check.
 * Do this checking under a lock; as multiple PCI devices
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c 
b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 1b5982f..03a3ed2 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -890,7 +890,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
opal_pci_eeh_freeze_clear(phb->opal_id, 
frozen_pe_no,
OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
ret = EEH_NEXT_ERR_NONE;
-   } else if ((*pe)->state & EEH_PE_ISOLATED) {
+   } else if ((*pe)->state & EEH_PE_ISOLATED ||
+  eeh_pe_passed(*pe)) {
ret = EEH_NEXT_ERR_NONE;
} else {
pr_err("EEH: Frozen PHB#%x-PE#%x (%s) 
detected\n",
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFCv4 0/4] EEH Support for VFIO PCI device

2014-05-20 Thread Gavin Shan

The series of patches intends to support EEH for PCI devices, which are
passed through to PowerKVM based guest via VFIO. The implementation is
straightforward based on the issues or problems we have to resolve to
support EEH for PowerKVM based guest.

- Emulation for EEH RTAS requests. All EEH RTAS requests goes to QEMU firstly.
  If QEMU can't handle it, the request will be sent to host via newly introduced
  VFIO container IOCTL command (VFIO_EEH_OP) and gets handled in host kernel.

The series of patches requires corresponding QEMU changes.

Change log
==
v1 -> v2:
* EEH RTAS requests are routed to QEMU, and then possiblly to host 
kerenl.
  The mechanism KVM in-kernel handling is dropped.
* Error injection is reimplemented based syscall, instead of KVM 
in-kerenl
  handling. The logic for error injection token management is moved to
  QEMU. The error injection request is routed to QEMU and then possiblly
  to host kernel.
v2 -> v3:
* Make the fields in struct eeh_vfio_pci_addr, struct vfio_eeh_info 
based
  on the comments from Alexey.
* Define macros for EEH VFIO operations (Alexey).
* Clear frozen state after successful PE reset.
* Merge original [PATCH 1/2/3] to one.
v3 -> v4:
* Remove the error injection from the patchset. Mike or I will work on 
that
  later.
* Rename CONFIG_VFIO_EEH to VFIO_PCI_EEH.
* Rename the IOCTL command to VFIO_EEH_OP and it's handled by VFIO-PCI 
device
  instead of VFIO container.
* Rename the IOCTL argument structure to "vfio_eeh_op" accordingly. 
Also, more
  fields added to hold return values for RTAS requests.
* The address mapping stuff is totally removed. When opening or 
releasing VFIO
  PCI device, notification sent to EEH to update the flags indicates 
the device
  is passed to guest or not.
* Change pr_warn() to pr_debug() to avoid DOS as pointed by Alex.W
* Argument size check issue pointed by Alex.W.

Testing on P7
=

- Emulex adapter

Testing on P8
=
Need testing for more after the code is finalized. The logic is required by 
P7/P8
machines shouldn't be different.

Gavin Shan (4):
  drivers/vfio: Introduce CONFIG_VFIO_PCI_EEH
  powerpc/eeh: Flags for passed device and PE
  drivers/vfio: New IOCTL command VFIO_EEH_INFO
  powerpc/eeh: Avoid event on passed PE

 arch/powerpc/include/asm/eeh.h|  32 +++
 arch/powerpc/kernel/eeh.c |   8 +
 arch/powerpc/platforms/powernv/Makefile   |   1 +
 arch/powerpc/platforms/powernv/eeh-ioda.c |   3 +-
 arch/powerpc/platforms/powernv/eeh-vfio.c | 445 ++
 drivers/vfio/pci/Kconfig  |   6 +
 drivers/vfio/pci/vfio_pci.c   |  24 +-
 drivers/vfio/pci/vfio_pci_private.h   |  16 ++
 include/uapi/linux/vfio.h |  43 +++
 9 files changed, 571 insertions(+), 7 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c

-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6] KVM: PPC: Book3S PR: Expose TM registers

2014-05-20 Thread Paul Mackerras

On Mon, May 19, 2014 at 03:09:07PM +0200, Alexander Graf wrote:
> 
> On 17.05.14 08:20, Paul Mackerras wrote:
> >On Tue, Apr 29, 2014 at 06:17:42PM +0200, Alexander Graf wrote:
> >>POWER8 introduces transactional memory which brings along a number of new
> >>registers and MSR bits.
> >>
> >>Implementing all of those is a pretty big headache, so for now let's at 
> >>least
> >>emulate enough to make Linux's context switching code happy.
> >[snip]
> >
> >>-   if (!(vcpu->arch.fscr & (1ULL << fac))) {
> >>+   /* We get TM interrupts only when EBB is disabled? Sigh. */
> >This comment doesn't make sense to me.  Not every reason code reported
> >in the high bits of FSCR corresponds directly to an enable bit in
> >FSCR.  In fact, of the 7 defined reason codes in POWER8, only three
> >correspond to an enable bit...
> 
> Is there any documentation on which relate to what?

Yes, Power ISA v2.07 book 3S section 6.2.10 describes the FSCR enable
bits and the interruption cause field.  There are 6 cause values
defined, of which 3 correspond to enable bits in the FSCR, and the
other 3 correspond to things enabled/disabled in MMCR0 (usermode PMC
anb BHRB access) or MSR (TM stuff).

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/8] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Alexander Graf



On 20.05.14 10:28, Gavin Shan wrote:

On Mon, May 19, 2014 at 06:37:24PM -0600, Alex Williamson wrote:

On Tue, 2014-05-20 at 10:22 +1000, Gavin Shan wrote:

On Mon, May 19, 2014 at 04:33:10PM -0600, Alex Williamson wrote:

On Wed, 2014-05-14 at 14:11 +1000, Gavin Shan wrote:

The patch adds new IOCTL command VFIO_EEH_INFO to VFIO container
to support EEH functionality for PCI devices, which have been
passed from host to guest via VFIO.


[...]


diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..1fd1bfb 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -455,6 +455,63 @@ struct vfio_iommu_spapr_tce_info {
  
  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
  
+/*

+ * The VFIO EEH info struct provides way to support EEH functionality
+ * for PCI device that is passed from host to guest via VFIO.
+ */
+#define VFIO_EEH_OP_MAP0
+#define VFIO_EEH_OP_UNMAP  1
+#define VFIO_EEH_OP_SET_OPTION 2
+#define VFIO_EEH_OP_GET_ADDR   3
+#define VFIO_EEH_OP_GET_STATE  4
+#define VFIO_EEH_OP_PE_RESET   5
+#define VFIO_EEH_OP_PE_CONFIG  6

Is this really an "info" ioctl?


Yeah, "VFIO_EEH_INFO" isn't a good name. How about to have "VFIO_EEH_HANDLER" ?

VFIO_EEH_OP perhaps.  Thanks,


Ok. Will rename it to VFIO_EEH_OP in next revision.


Is there any benefit of a multiplexing EEH ioctl over just 7 individual 
ioctls?



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/8] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Gavin Shan

On Tue, May 20, 2014 at 12:02:22PM +0200, Alexander Graf wrote:
>
>On 20.05.14 10:28, Gavin Shan wrote:
>>On Mon, May 19, 2014 at 06:37:24PM -0600, Alex Williamson wrote:
>>>On Tue, 2014-05-20 at 10:22 +1000, Gavin Shan wrote:
On Mon, May 19, 2014 at 04:33:10PM -0600, Alex Williamson wrote:
>On Wed, 2014-05-14 at 14:11 +1000, Gavin Shan wrote:
>>The patch adds new IOCTL command VFIO_EEH_INFO to VFIO container
>>to support EEH functionality for PCI devices, which have been
>>passed from host to guest via VFIO.
>
>[...]
>
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..1fd1bfb 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -455,6 +455,63 @@ struct vfio_iommu_spapr_tce_info {
  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+/*
+ * The VFIO EEH info struct provides way to support EEH functionality
+ * for PCI device that is passed from host to guest via VFIO.
+ */
+#define VFIO_EEH_OP_MAP0
+#define VFIO_EEH_OP_UNMAP  1
+#define VFIO_EEH_OP_SET_OPTION 2
+#define VFIO_EEH_OP_GET_ADDR   3
+#define VFIO_EEH_OP_GET_STATE  4
+#define VFIO_EEH_OP_PE_RESET   5
+#define VFIO_EEH_OP_PE_CONFIG  6
>Is this really an "info" ioctl?
>
Yeah, "VFIO_EEH_INFO" isn't a good name. How about to have 
"VFIO_EEH_HANDLER" ?
>>>VFIO_EEH_OP perhaps.  Thanks,
>>>
>>Ok. Will rename it to VFIO_EEH_OP in next revision.
>
>Is there any benefit of a multiplexing EEH ioctl over just 7
>individual ioctls?
>

One benefit is pass one data struct to the real handler implemented
in arch/powerpc/platforms/powernv/eeh-vfio.c. With 7 commands, we
either passing "void *arg" + "command" to the handler, or export
7 functions with EXPORT_SYMBOL_GPL().

I just send RFCv4 out. VFIO_EEH_OP_{MAP, UNMAP} is removed there.

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Alexander Graf



On 20.05.14 10:30, Gavin Shan wrote:

The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
to support EEH functionality for PCI devices, which have been
passed from host to guest via VFIO.

Signed-off-by: Gavin Shan 
---
  arch/powerpc/platforms/powernv/Makefile   |   1 +
  arch/powerpc/platforms/powernv/eeh-vfio.c | 445 ++
  drivers/vfio/pci/vfio_pci.c   |  24 +-
  drivers/vfio/pci/vfio_pci_private.h   |  16 ++
  include/uapi/linux/vfio.h |  43 +++
  5 files changed, 523 insertions(+), 6 deletions(-)
  create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c


Why doesn't this code live inside the vfio module? If I don't load the 
vfio module, I don't need that code to waste memory in my kernel, no?



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Alexander Graf



On 20.05.14 10:30, Gavin Shan wrote:

If we detects frozen state on PE that has been passed to guest, we
needn't handle it. Instead, we rely on the guest to detect and recover
it. The patch avoid EEH event on the frozen passed PE so that the guest
can have chance to handle that.

Signed-off-by: Gavin Shan 


How does the guest learn about this failure? We'd need to inject an 
error into it, no?


I think what you want is an irqfd that the in-kernel eeh code notifies 
when it sees a failure. When such an fd exists, the kernel skips its own 
error handling.



Alex


---
  arch/powerpc/kernel/eeh.c | 8 
  arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++-
  2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 9c6b899..6543f05 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
if (ret > 0)
return ret;
  
+	/*

+* If the PE has been passed to guest, we won't check the
+* state. Instead, let the guest handle it if the PE has
+* been frozen.
+*/
+   if (eeh_pe_passed(pe))
+   return 0;
+
/* If we already have a pending isolation event for this
 * slot, we know it's bad already, we don't need to check.
 * Do this checking under a lock; as multiple PCI devices
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c 
b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 1b5982f..03a3ed2 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -890,7 +890,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
opal_pci_eeh_freeze_clear(phb->opal_id, 
frozen_pe_no,
OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
ret = EEH_NEXT_ERR_NONE;
-   } else if ((*pe)->state & EEH_PE_ISOLATED) {
+   } else if ((*pe)->state & EEH_PE_ISOLATED ||
+  eeh_pe_passed(*pe)) {
ret = EEH_NEXT_ERR_NONE;
} else {
pr_err("EEH: Frozen PHB#%x-PE#%x (%s) 
detected\n",


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Alexander Graf



On 20.05.14 13:21, Alexander Graf wrote:


On 20.05.14 10:30, Gavin Shan wrote:

The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
to support EEH functionality for PCI devices, which have been
passed from host to guest via VFIO.

Signed-off-by: Gavin Shan 
---
  arch/powerpc/platforms/powernv/Makefile   |   1 +
  arch/powerpc/platforms/powernv/eeh-vfio.c | 445 
++

  drivers/vfio/pci/vfio_pci.c   |  24 +-
  drivers/vfio/pci/vfio_pci_private.h   |  16 ++
  include/uapi/linux/vfio.h |  43 +++
  5 files changed, 523 insertions(+), 6 deletions(-)
  create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c


Why doesn't this code live inside the vfio module? If I don't load the 
vfio module, I don't need that code to waste memory in my kernel, no?


So I think from a modeling point of view, you want VFIO code that calls 
reasonably generic helpers inside the kernel to deal with errors.


The "generic helpers" don't have anything to do with VFIO. Everything 
that interfaces via ioctls with user space is 100% VFIO code.


The latter should be tristate inside vfio.ko, the former can be =y.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Gavin Shan

On Tue, May 20, 2014 at 01:28:40PM +0200, Alexander Graf wrote:
>
>On 20.05.14 13:21, Alexander Graf wrote:
>>
>>On 20.05.14 10:30, Gavin Shan wrote:
>>>The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
>>>to support EEH functionality for PCI devices, which have been
>>>passed from host to guest via VFIO.
>>>
>>>Signed-off-by: Gavin Shan 
>>>---
>>>  arch/powerpc/platforms/powernv/Makefile   |   1 +
>>>  arch/powerpc/platforms/powernv/eeh-vfio.c | 445
>>>++
>>>  drivers/vfio/pci/vfio_pci.c   |  24 +-
>>>  drivers/vfio/pci/vfio_pci_private.h   |  16 ++
>>>  include/uapi/linux/vfio.h |  43 +++
>>>  5 files changed, 523 insertions(+), 6 deletions(-)
>>>  create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c
>>
>>Why doesn't this code live inside the vfio module? If I don't load
>>the vfio module, I don't need that code to waste memory in my
>>kernel, no?

Yes, It saves some memory.

>
>So I think from a modeling point of view, you want VFIO code that
>calls reasonably generic helpers inside the kernel to deal with
>errors.
>
>The "generic helpers" don't have anything to do with VFIO. Everything
>that interfaces via ioctls with user space is 100% VFIO code.
>
>The latter should be tristate inside vfio.ko, the former can be =y.
>

The main reason I put eeh-vfio.c to arch/powerpc/platforms/powernv/ is
the source file needs access data structures (struct pnv_phb) defined
in "pci.h" under that directory.

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Alexander Graf



On 20.05.14 13:40, Gavin Shan wrote:

On Tue, May 20, 2014 at 01:28:40PM +0200, Alexander Graf wrote:

On 20.05.14 13:21, Alexander Graf wrote:

On 20.05.14 10:30, Gavin Shan wrote:

The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
to support EEH functionality for PCI devices, which have been
passed from host to guest via VFIO.

Signed-off-by: Gavin Shan 
---
  arch/powerpc/platforms/powernv/Makefile   |   1 +
  arch/powerpc/platforms/powernv/eeh-vfio.c | 445
++
  drivers/vfio/pci/vfio_pci.c   |  24 +-
  drivers/vfio/pci/vfio_pci_private.h   |  16 ++
  include/uapi/linux/vfio.h |  43 +++
  5 files changed, 523 insertions(+), 6 deletions(-)
  create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c

Why doesn't this code live inside the vfio module? If I don't load
the vfio module, I don't need that code to waste memory in my
kernel, no?

Yes, It saves some memory.


So I think from a modeling point of view, you want VFIO code that
calls reasonably generic helpers inside the kernel to deal with
errors.

The "generic helpers" don't have anything to do with VFIO. Everything
that interfaces via ioctls with user space is 100% VFIO code.

The latter should be tristate inside vfio.ko, the former can be =y.


The main reason I put eeh-vfio.c to arch/powerpc/platforms/powernv/ is
the source file needs access data structures (struct pnv_phb) defined
in "pci.h" under that directory.


Then create a good in-kernel framework from that directory and make use 
of it from the VFIO code :). But please don't mesh together VFIO, 
powernv EEH handling and RTAS.



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6] KVM: PPC: Book3S PR: Expose TM registers

2014-05-20 Thread Alexander Graf


On 20.05.2014, at 11:59, Paul Mackerras  wrote:

> On Mon, May 19, 2014 at 03:09:07PM +0200, Alexander Graf wrote:
>> 
>> On 17.05.14 08:20, Paul Mackerras wrote:
>>> On Tue, Apr 29, 2014 at 06:17:42PM +0200, Alexander Graf wrote:
 POWER8 introduces transactional memory which brings along a number of new
 registers and MSR bits.
 
 Implementing all of those is a pretty big headache, so for now let's at 
 least
 emulate enough to make Linux's context switching code happy.
>>> [snip]
>>> 
 -  if (!(vcpu->arch.fscr & (1ULL << fac))) {
 +  /* We get TM interrupts only when EBB is disabled? Sigh. */
>>> This comment doesn't make sense to me.  Not every reason code reported
>>> in the high bits of FSCR corresponds directly to an enable bit in
>>> FSCR.  In fact, of the 7 defined reason codes in POWER8, only three
>>> correspond to an enable bit...
>> 
>> Is there any documentation on which relate to what?
> 
> Yes, Power ISA v2.07 book 3S section 6.2.10 describes the FSCR enable
> bits and the interruption cause field.  There are 6 cause values
> defined, of which 3 correspond to enable bits in the FSCR, and the
> other 3 correspond to things enabled/disabled in MMCR0 (usermode PMC
> anb BHRB access) or MSR (TM stuff).

I see. How's this?

Alex

commit a8e53f5f5e6c5d99363ad0d695a9ee520e1d262d
Author: Alexander Graf 
Date:   Tue Apr 29 17:54:40 2014 +0200

KVM: PPC: Book3S PR: Expose TM registers

POWER8 introduces transactional memory which brings along a number of new
registers and MSR bits.

Implementing all of those is a pretty big headache, so for now let's at 
least
emulate enough to make Linux's context switching code happy.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - move to book3s_64 only section
  - restrict to CONFIG_PPC_TRANSACTIONAL_MEM

v2 -> v3:

  - check MSR.TM for TM enablement inside the guest

diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index e1165ba..9bdff15 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -451,6 +451,17 @@ int kvmppc_core_emulate_mtspr_pr(struct kvm_vcpu *vcpu, 
int sprn, ulong spr_val)
case SPRN_EBBRR:
vcpu->arch.ebbrr = spr_val;
break;
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   case SPRN_TFHAR:
+   vcpu->arch.tfhar = spr_val;
+   break;
+   case SPRN_TEXASR:
+   vcpu->arch.texasr = spr_val;
+   break;
+   case SPRN_TFIAR:
+   vcpu->arch.tfiar = spr_val;
+   break;
+#endif
 #endif
case SPRN_ICTC:
case SPRN_THRM1:
@@ -572,6 +583,17 @@ int kvmppc_core_emulate_mfspr_pr(struct kvm_vcpu *vcpu, 
int sprn, ulong *spr_val
case SPRN_EBBRR:
*spr_val = vcpu->arch.ebbrr;
break;
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   case SPRN_TFHAR:
+   *spr_val = vcpu->arch.tfhar;
+   break;
+   case SPRN_TEXASR:
+   *spr_val = vcpu->arch.texasr;
+   break;
+   case SPRN_TFIAR:
+   *spr_val = vcpu->arch.tfiar;
+   break;
+#endif
 #endif
case SPRN_THRM1:
case SPRN_THRM2:
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 7d27a95..23367a7 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -794,9 +794,27 @@ static void kvmppc_emulate_fac(struct kvm_vcpu *vcpu, 
ulong fac)
 /* Enable facilities (TAR, EBB, DSCR) for the guest */
 static int kvmppc_handle_fac(struct kvm_vcpu *vcpu, ulong fac)
 {
+   bool guest_fac_enabled;
BUG_ON(!cpu_has_feature(CPU_FTR_ARCH_207S));

-   if (!(vcpu->arch.fscr & (1ULL << fac))) {
+   /*
+* Not every facility is enabled by FSCR bits, check whether the
+* guest has this facility enabled at all.
+*/
+   switch (fac) {
+   case FSCR_TAR_LG:
+   case FSCR_EBB_LG:
+   guest_fac_enabled = (vcpu->arch.fscr & (1ULL << fac));
+   break;
+   case FSCR_TM_LG:
+   guest_fac_enabled = kvmppc_get_msr(vcpu) & MSR_TM;
+   break;
+   default:
+   guest_fac_enabled = false;
+   break;
+   }
+
+   if (!guest_fac_enabled) {
/* Facility not enabled by the guest */
kvmppc_trigger_fac_interrupt(vcpu, fac);
return RESUME_GUEST;--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Gavin Shan

On Tue, May 20, 2014 at 01:25:11PM +0200, Alexander Graf wrote:
>
>On 20.05.14 10:30, Gavin Shan wrote:
>>If we detects frozen state on PE that has been passed to guest, we
>>needn't handle it. Instead, we rely on the guest to detect and recover
>>it. The patch avoid EEH event on the frozen passed PE so that the guest
>>can have chance to handle that.
>>
>>Signed-off-by: Gavin Shan 
>
>How does the guest learn about this failure? We'd need to inject an
>error into it, no?
>

When error is existing in HW level, 0xFF's will be turned on reading
PCI config space or memory BARs. Guest retrieves the failure state,
which is captured by HW automatically, via RTAS call
"ibm,read-slot-reset-state2" when seeing 0xFF's on reading PCI config
space or memory BARs. If "ibm,read-slot-reset-state2" reports errors in HW,
the guest kernel starts to recovery.

It can be called as "passive" reporting. There possible has one case that
the error can't be reported for ever: No device driver binding to the VFIO
PCI device and no access to device's config space and memory BARs. However,
it doesn't matter. As we don't use the device, we needn't detect and recover
the error at all.

>I think what you want is an irqfd that the in-kernel eeh code
>notifies when it sees a failure. When such an fd exists, the kernel
>skips its own error handling.
>

Yeah, it's a good idea and something for me to improve in phase II. We
can discuss for more later. For now, what I have in my head is something
like this:

  [ Host ] -> Error detected -> irqfd (or eventfd) -> QEMU 
   |
   -(A)-
   |
Send one EEH event to guest kernel
   |
Guest kernel starts the recovery

(A): I didn't figure out one convienent way to do the EEH event injection yet.

Thanks,
Gavin

>>---
>>  arch/powerpc/kernel/eeh.c | 8 
>>  arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++-
>>  2 files changed, 10 insertions(+), 1 deletion(-)
>>
>>diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>>index 9c6b899..6543f05 100644
>>--- a/arch/powerpc/kernel/eeh.c
>>+++ b/arch/powerpc/kernel/eeh.c
>>@@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>>  if (ret > 0)
>>  return ret;
>>+ /*
>>+  * If the PE has been passed to guest, we won't check the
>>+  * state. Instead, let the guest handle it if the PE has
>>+  * been frozen.
>>+  */
>>+ if (eeh_pe_passed(pe))
>>+ return 0;
>>+
>>  /* If we already have a pending isolation event for this
>>   * slot, we know it's bad already, we don't need to check.
>>   * Do this checking under a lock; as multiple PCI devices
>>diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c 
>>b/arch/powerpc/platforms/powernv/eeh-ioda.c
>>index 1b5982f..03a3ed2 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
>>@@ -890,7 +890,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
>>  opal_pci_eeh_freeze_clear(phb->opal_id, 
>> frozen_pe_no,
>>  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>  ret = EEH_NEXT_ERR_NONE;
>>- } else if ((*pe)->state & EEH_PE_ISOLATED) {
>>+ } else if ((*pe)->state & EEH_PE_ISOLATED ||
>>+eeh_pe_passed(*pe)) {
>>  ret = EEH_NEXT_ERR_NONE;
>>  } else {
>>  pr_err("EEH: Frozen PHB#%x-PE#%x (%s) 
>> detected\n",
>

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Alexander Graf



On 20.05.14 13:56, Gavin Shan wrote:

On Tue, May 20, 2014 at 01:25:11PM +0200, Alexander Graf wrote:

On 20.05.14 10:30, Gavin Shan wrote:

If we detects frozen state on PE that has been passed to guest, we
needn't handle it. Instead, we rely on the guest to detect and recover
it. The patch avoid EEH event on the frozen passed PE so that the guest
can have chance to handle that.

Signed-off-by: Gavin Shan 

How does the guest learn about this failure? We'd need to inject an
error into it, no?


When error is existing in HW level, 0xFF's will be turned on reading
PCI config space or memory BARs. Guest retrieves the failure state,
which is captured by HW automatically, via RTAS call
"ibm,read-slot-reset-state2" when seeing 0xFF's on reading PCI config
space or memory BARs. If "ibm,read-slot-reset-state2" reports errors in HW,
the guest kernel starts to recovery.

It can be called as "passive" reporting. There possible has one case that
the error can't be reported for ever: No device driver binding to the VFIO
PCI device and no access to device's config space and memory BARs. However,
it doesn't matter. As we don't use the device, we needn't detect and recover
the error at all.


So if the guest is waiting for an interrupt to happen it will wait 
forever? Not really nice.



I think what you want is an irqfd that the in-kernel eeh code
notifies when it sees a failure. When such an fd exists, the kernel
skips its own error handling.


Yeah, it's a good idea and something for me to improve in phase II. We
can discuss for more later.


I think it makes sense to at least walk into that direction immediately. 
The reason I brought it up in the context of this patch is that with an 
irqfd you wouldn't need the passed flag at all.



  For now, what I have in my head is something
like this:

   [ Host ] -> Error detected -> irqfd (or eventfd) -> QEMU
|
-(A)-
|
 Send one EEH event to guest kernel
|
 Guest kernel starts the recovery

(A): I didn't figure out one convienent way to do the EEH event injection yet.


How does the guest learn about errors in pHyp?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Gavin Shan

On Tue, May 20, 2014 at 01:44:21PM +0200, Alexander Graf wrote:
>
>On 20.05.14 13:40, Gavin Shan wrote:
>>On Tue, May 20, 2014 at 01:28:40PM +0200, Alexander Graf wrote:
>>>On 20.05.14 13:21, Alexander Graf wrote:
On 20.05.14 10:30, Gavin Shan wrote:
>The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
>to support EEH functionality for PCI devices, which have been
>passed from host to guest via VFIO.
>
>Signed-off-by: Gavin Shan 
>---
>  arch/powerpc/platforms/powernv/Makefile   |   1 +
>  arch/powerpc/platforms/powernv/eeh-vfio.c | 445
>++
>  drivers/vfio/pci/vfio_pci.c   |  24 +-
>  drivers/vfio/pci/vfio_pci_private.h   |  16 ++
>  include/uapi/linux/vfio.h |  43 +++
>  5 files changed, 523 insertions(+), 6 deletions(-)
>  create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c
Why doesn't this code live inside the vfio module? If I don't load
the vfio module, I don't need that code to waste memory in my
kernel, no?
>>Yes, It saves some memory.
>>
>>>So I think from a modeling point of view, you want VFIO code that
>>>calls reasonably generic helpers inside the kernel to deal with
>>>errors.
>>>
>>>The "generic helpers" don't have anything to do with VFIO. Everything
>>>that interfaces via ioctls with user space is 100% VFIO code.
>>>
>>>The latter should be tristate inside vfio.ko, the former can be =y.
>>>
>>The main reason I put eeh-vfio.c to arch/powerpc/platforms/powernv/ is
>>the source file needs access data structures (struct pnv_phb) defined
>>in "pci.h" under that directory.
>
>Then create a good in-kernel framework from that directory and make
>use of it from the VFIO code :). But please don't mesh together VFIO,
>powernv EEH handling and RTAS.
>

Yeah. How about this? :-)

- Move eeh-vfio.c to drivers/vfio/pci/
- From eeh-vfio.c, dereference arch/powerpc/kernel/eeh.c::eeh_ops, which
  is arch/powerpc/plaforms/powernv/eeh-powernv.c::powernv_eeh_ops. Call
  to the corresponding callbacks in "eeh_ops" based on incoming RTAS request.

  The file would be renamed to "vfio_eeh.c" as well after moving to VFIO
  driver directory.

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Alexander Graf



On 20.05.14 14:21, Gavin Shan wrote:

On Tue, May 20, 2014 at 01:44:21PM +0200, Alexander Graf wrote:

On 20.05.14 13:40, Gavin Shan wrote:

On Tue, May 20, 2014 at 01:28:40PM +0200, Alexander Graf wrote:

On 20.05.14 13:21, Alexander Graf wrote:

On 20.05.14 10:30, Gavin Shan wrote:

The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
to support EEH functionality for PCI devices, which have been
passed from host to guest via VFIO.

Signed-off-by: Gavin Shan 
---
  arch/powerpc/platforms/powernv/Makefile   |   1 +
  arch/powerpc/platforms/powernv/eeh-vfio.c | 445
++
  drivers/vfio/pci/vfio_pci.c   |  24 +-
  drivers/vfio/pci/vfio_pci_private.h   |  16 ++
  include/uapi/linux/vfio.h |  43 +++
  5 files changed, 523 insertions(+), 6 deletions(-)
  create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c

Why doesn't this code live inside the vfio module? If I don't load
the vfio module, I don't need that code to waste memory in my
kernel, no?

Yes, It saves some memory.


So I think from a modeling point of view, you want VFIO code that
calls reasonably generic helpers inside the kernel to deal with
errors.

The "generic helpers" don't have anything to do with VFIO. Everything
that interfaces via ioctls with user space is 100% VFIO code.

The latter should be tristate inside vfio.ko, the former can be =y.


The main reason I put eeh-vfio.c to arch/powerpc/platforms/powernv/ is
the source file needs access data structures (struct pnv_phb) defined
in "pci.h" under that directory.

Then create a good in-kernel framework from that directory and make
use of it from the VFIO code :). But please don't mesh together VFIO,
powernv EEH handling and RTAS.


Yeah. How about this? :-)

- Move eeh-vfio.c to drivers/vfio/pci/
- From eeh-vfio.c, dereference arch/powerpc/kernel/eeh.c::eeh_ops, which
   is arch/powerpc/plaforms/powernv/eeh-powernv.c::powernv_eeh_ops. Call


Hrm, I think it'd be nicer to just export individual functions that do 
thing you want to do from eeh.c.



Alex


   to the corresponding callbacks in "eeh_ops" based on incoming RTAS request.

   The file would be renamed to "vfio_eeh.c" as well after moving to VFIO
   driver directory.

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Gavin Shan

On Tue, May 20, 2014 at 02:25:45PM +0200, Alexander Graf wrote:
>
>On 20.05.14 14:21, Gavin Shan wrote:
>>On Tue, May 20, 2014 at 01:44:21PM +0200, Alexander Graf wrote:
>>>On 20.05.14 13:40, Gavin Shan wrote:
On Tue, May 20, 2014 at 01:28:40PM +0200, Alexander Graf wrote:
>On 20.05.14 13:21, Alexander Graf wrote:
>>On 20.05.14 10:30, Gavin Shan wrote:
>>>The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
>>>to support EEH functionality for PCI devices, which have been
>>>passed from host to guest via VFIO.
>>>
>>>Signed-off-by: Gavin Shan 
>>>---
>>>  arch/powerpc/platforms/powernv/Makefile   |   1 +
>>>  arch/powerpc/platforms/powernv/eeh-vfio.c | 445
>>>++
>>>  drivers/vfio/pci/vfio_pci.c   |  24 +-
>>>  drivers/vfio/pci/vfio_pci_private.h   |  16 ++
>>>  include/uapi/linux/vfio.h |  43 +++
>>>  5 files changed, 523 insertions(+), 6 deletions(-)
>>>  create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c
>>Why doesn't this code live inside the vfio module? If I don't load
>>the vfio module, I don't need that code to waste memory in my
>>kernel, no?
Yes, It saves some memory.

>So I think from a modeling point of view, you want VFIO code that
>calls reasonably generic helpers inside the kernel to deal with
>errors.
>
>The "generic helpers" don't have anything to do with VFIO. Everything
>that interfaces via ioctls with user space is 100% VFIO code.
>
>The latter should be tristate inside vfio.ko, the former can be =y.
>
The main reason I put eeh-vfio.c to arch/powerpc/platforms/powernv/ is
the source file needs access data structures (struct pnv_phb) defined
in "pci.h" under that directory.
>>>Then create a good in-kernel framework from that directory and make
>>>use of it from the VFIO code :). But please don't mesh together VFIO,
>>>powernv EEH handling and RTAS.
>>>
>>Yeah. How about this? :-)
>>
>>- Move eeh-vfio.c to drivers/vfio/pci/
>>- From eeh-vfio.c, dereference arch/powerpc/kernel/eeh.c::eeh_ops, which
>>   is arch/powerpc/plaforms/powernv/eeh-powernv.c::powernv_eeh_ops. Call
>
>Hrm, I think it'd be nicer to just export individual functions that
>do thing you want to do from eeh.c.
>

Ok. Got it. Thanks for your comments :)

Thanks,
Gavin

>
>Alex
>
>>   to the corresponding callbacks in "eeh_ops" based on incoming RTAS request.
>>
>>   The file would be renamed to "vfio_eeh.c" as well after moving to VFIO
>>   driver directory.
>>
>>Thanks,
>>Gavin
>>
>>--
>>To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
>>the body of a message to majord...@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>--
>To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Gavin Shan

On Tue, May 20, 2014 at 02:14:56PM +0200, Alexander Graf wrote:
>
>On 20.05.14 13:56, Gavin Shan wrote:
>>On Tue, May 20, 2014 at 01:25:11PM +0200, Alexander Graf wrote:
>>>On 20.05.14 10:30, Gavin Shan wrote:
If we detects frozen state on PE that has been passed to guest, we
needn't handle it. Instead, we rely on the guest to detect and recover
it. The patch avoid EEH event on the frozen passed PE so that the guest
can have chance to handle that.

Signed-off-by: Gavin Shan 
>>>How does the guest learn about this failure? We'd need to inject an
>>>error into it, no?
>>>
>>When error is existing in HW level, 0xFF's will be turned on reading
>>PCI config space or memory BARs. Guest retrieves the failure state,
>>which is captured by HW automatically, via RTAS call
>>"ibm,read-slot-reset-state2" when seeing 0xFF's on reading PCI config
>>space or memory BARs. If "ibm,read-slot-reset-state2" reports errors in HW,
>>the guest kernel starts to recovery.
>>
>>It can be called as "passive" reporting. There possible has one case that
>>the error can't be reported for ever: No device driver binding to the VFIO
>>PCI device and no access to device's config space and memory BARs. However,
>>it doesn't matter. As we don't use the device, we needn't detect and recover
>>the error at all.
>
>So if the guest is waiting for an interrupt to happen it will wait
>forever? Not really nice.
>

Nope, the error reporting in guest isn't interrupt-driven. It's always
"polling" :-)

>>>I think what you want is an irqfd that the in-kernel eeh code
>>>notifies when it sees a failure. When such an fd exists, the kernel
>>>skips its own error handling.
>>>
>>Yeah, it's a good idea and something for me to improve in phase II. We
>>can discuss for more later.
>
>I think it makes sense to at least walk into that direction
>immediately. The reason I brought it up in the context of this patch
>is that with an irqfd you wouldn't need the passed flag at all.
>

I don't see how it can avoid the "passed" flag. Without the flag, any
PCI config and memory BAR access on host side could trigger EEH recovery
for those PCI devices passed to guest. That's unexpected behaviour. 

For host, we have 2 ways to report errors: interrupt driven and polling.
For the guest, we only have "polling" :-)

>>  For now, what I have in my head is something
>>like this:
>>
>>   [ Host ] -> Error detected -> irqfd (or eventfd) -> QEMU
>>|
>>-(A)-
>>|
>> Send one EEH event to guest kernel
>>|
>> Guest kernel starts the recovery
>>
>>(A): I didn't figure out one convienent way to do the EEH event injection yet.
>
>How does the guest learn about errors in pHyp?
>

It relies on "polling".

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Alexander Graf



On 20.05.14 14:45, Gavin Shan wrote:

On Tue, May 20, 2014 at 02:14:56PM +0200, Alexander Graf wrote:

On 20.05.14 13:56, Gavin Shan wrote:

On Tue, May 20, 2014 at 01:25:11PM +0200, Alexander Graf wrote:

On 20.05.14 10:30, Gavin Shan wrote:

If we detects frozen state on PE that has been passed to guest, we
needn't handle it. Instead, we rely on the guest to detect and recover
it. The patch avoid EEH event on the frozen passed PE so that the guest
can have chance to handle that.

Signed-off-by: Gavin Shan 

How does the guest learn about this failure? We'd need to inject an
error into it, no?


When error is existing in HW level, 0xFF's will be turned on reading
PCI config space or memory BARs. Guest retrieves the failure state,
which is captured by HW automatically, via RTAS call
"ibm,read-slot-reset-state2" when seeing 0xFF's on reading PCI config
space or memory BARs. If "ibm,read-slot-reset-state2" reports errors in HW,
the guest kernel starts to recovery.

It can be called as "passive" reporting. There possible has one case that
the error can't be reported for ever: No device driver binding to the VFIO
PCI device and no access to device's config space and memory BARs. However,
it doesn't matter. As we don't use the device, we needn't detect and recover
the error at all.

So if the guest is waiting for an interrupt to happen it will wait
forever? Not really nice.


Nope, the error reporting in guest isn't interrupt-driven. It's always
"polling" :-)


That sucks :).




I think what you want is an irqfd that the in-kernel eeh code
notifies when it sees a failure. When such an fd exists, the kernel
skips its own error handling.


Yeah, it's a good idea and something for me to improve in phase II. We
can discuss for more later.

I think it makes sense to at least walk into that direction
immediately. The reason I brought it up in the context of this patch
is that with an irqfd you wouldn't need the passed flag at all.


I don't see how it can avoid the "passed" flag. Without the flag, any
PCI config and memory BAR access on host side could trigger EEH recovery
for those PCI devices passed to guest. That's unexpected behaviour.


Instead of

  if (passed_flag)
return;

you would do

  if (trigger_irqfd) {
trigger_irqfd();
return;
  }

which would be a much nicer, generic interface.


For host, we have 2 ways to report errors: interrupt driven and polling.
For the guest, we only have "polling" :-)


And the interrupt path is powernv specific? Does sPAPR specify anything 
here?





  For now, what I have in my head is something
like this:

   [ Host ] -> Error detected -> irqfd (or eventfd) -> QEMU
|
-(A)-
|
 Send one EEH event to guest kernel
|
 Guest kernel starts the recovery

(A): I didn't figure out one convienent way to do the EEH event injection yet.

How does the guest learn about errors in pHyp?


It relies on "polling".


Sigh ;).

So how about we just implement this whole thing properly as irqfd? 
Whether QEMU can actually do anything with the interrupt is a different 
question - we can leave it be for now. But we could model all the code 
with the assumption that it should either handle the error itself or 
trigger and irqfd write.



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Benjamin Herrenschmidt

On Tue, 2014-05-20 at 21:56 +1000, Gavin Shan wrote:

 .../...

> >I think what you want is an irqfd that the in-kernel eeh code
> >notifies when it sees a failure. When such an fd exists, the kernel
> >skips its own error handling.
> >
> 
> Yeah, it's a good idea and something for me to improve in phase II. We
> can discuss for more later. For now, what I have in my head is something
> like this:

However, this would be a deviation from (or extension of) PAPR. At the
moment, the way things work in PAPR is that the guest is responsible for
querying the EEH state when something "looks" like an error (ie, getting
ff's back). This is also how it works in pHyp.

We have an interrupt path in the host when doing "native" EEH, and it
would be nice to extend PAPR to also be able to shoot an event to the
guest possibly using RTAS events, but let's get the basics working and
upstream first.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Benjamin Herrenschmidt

On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote:
> Instead of
> 
>if (passed_flag)
>  return;
> 
> you would do
> 
>if (trigger_irqfd) {
>  trigger_irqfd();
>  return;
>}
> 
> which would be a much nicer, generic interface.

But that's not how PAPR works.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Benjamin Herrenschmidt

On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote:
> So how about we just implement this whole thing properly as irqfd? 
> Whether QEMU can actually do anything with the interrupt is a different 
> question - we can leave it be for now. But we could model all the code 
> with the assumption that it should either handle the error itself or 
> trigger and irqfd write.

I don't object to the idea... however this smells of Deja Vu...

You often tend to want to turn something submitted that fills a specific
gap and implements a specific spec/function into some kind of idealized
grand design :-) And that means nothing gets upstream for weeks or monthes
as we churn and churn...

Sometimes it's probably worth it. Here I would argue against it and would
advocate for doing the basic functionality first, as it is used by guests,
and later add the irqfd option. I don't see any emergency here and adding
the irqfd will not cause fundamental design changes:

The "passed" flag (though I'm not fan of the name) is really something
we want in the low level handlers to avoid triggering host side EEH in
various places, regardless of whether we use irqfd or not.

This is totally orthogonal from the mechanism used for notifications.

Even in host, the detection path doesn't always involve interrupts, and
we can detect some things as a result of a host side config space access
for example etc...

So let's keep things nice and separate here. The interrupt notification
is just an "optimization" which will speed up discovery of the error in
*some* cases later on (but adds its own complexity since we have multiple
discovery path in host, so we need to keep track whether we have notified
yet or not etc...) so let's keep it for later.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Benjamin Herrenschmidt

On Tue, 2014-05-20 at 14:25 +0200, Alexander Graf wrote:
> > - Move eeh-vfio.c to drivers/vfio/pci/
> > - From eeh-vfio.c, dereference arch/powerpc/kernel/eeh.c::eeh_ops,
> which
> >is arch/powerpc/plaforms/powernv/eeh-powernv.c::powernv_eeh_ops.
> Call
> 
> Hrm, I think it'd be nicer to just export individual functions that do
> thing you want to do from eeh.c.

We already have an eeh_ops backend system with callbacks since we have
different backends for RTAS and powernv, so we could do what you suggest
but it would probably just boil down to wrappers around the EEH ops.

No big opinion either way on my side though.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Benjamin Herrenschmidt

On Tue, 2014-05-20 at 22:39 +1000, Gavin Shan wrote:
> >>Yeah. How about this? :-)
> >>
> >>- Move eeh-vfio.c to drivers/vfio/pci/
> >>- From eeh-vfio.c, dereference arch/powerpc/kernel/eeh.c::eeh_ops, which
> >>   is arch/powerpc/plaforms/powernv/eeh-powernv.c::powernv_eeh_ops. Call
> >
> >Hrm, I think it'd be nicer to just export individual functions that
> >do thing you want to do from eeh.c.
> >
> 
> Ok. Got it. Thanks for your comments :)

The interesting thing with this approach is that VFIO per-se can work
with EEH RTAS backend too in the host.

IE, with PR KVM for example or with non-KVM uses of VFIO, it would be
possible to use a device in a user process and exploit EEH even when
running under a PAPR hypervisor.

That is, vfio-eeh uses "generic" exported EEH APIs from the EEH core
that will work on both powernv and RTAS backends.

Note to Alex: This definitely kills the notifier idea for now though,
at least as a first class citizen of the design. We can add it as an
optional optimization on top later.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Gavin Shan

On Wed, May 21, 2014 at 10:23:52AM +1000, Benjamin Herrenschmidt wrote:
>On Tue, 2014-05-20 at 22:39 +1000, Gavin Shan wrote:
>> >>Yeah. How about this? :-)
>> >>
>> >>- Move eeh-vfio.c to drivers/vfio/pci/
>> >>- From eeh-vfio.c, dereference arch/powerpc/kernel/eeh.c::eeh_ops, which
>> >>   is arch/powerpc/plaforms/powernv/eeh-powernv.c::powernv_eeh_ops. Call
>> >
>> >Hrm, I think it'd be nicer to just export individual functions that
>> >do thing you want to do from eeh.c.
>> >
>> 
>> Ok. Got it. Thanks for your comments :)
>
>The interesting thing with this approach is that VFIO per-se can work
>with EEH RTAS backend too in the host.
>

Yeah, it's another benefit for the approach. I need think about it
later. Thanks, Ben. It's really good point.

>IE, with PR KVM for example or with non-KVM uses of VFIO, it would be
>possible to use a device in a user process and exploit EEH even when
>running under a PAPR hypervisor.
>
>That is, vfio-eeh uses "generic" exported EEH APIs from the EEH core
>that will work on both powernv and RTAS backends.
>
>Note to Alex: This definitely kills the notifier idea for now though,
>at least as a first class citizen of the design. We can add it as an
>optional optimization on top later.
>

Yeah, I'll address the notifier later.

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Gavin Shan

On Wed, May 21, 2014 at 10:12:11AM +1000, Benjamin Herrenschmidt wrote:
>On Tue, 2014-05-20 at 21:56 +1000, Gavin Shan wrote:
>
> .../...
>
>> >I think what you want is an irqfd that the in-kernel eeh code
>> >notifies when it sees a failure. When such an fd exists, the kernel
>> >skips its own error handling.
>> >
>> 
>> Yeah, it's a good idea and something for me to improve in phase II. We
>> can discuss for more later. For now, what I have in my head is something
>> like this:
>
>However, this would be a deviation from (or extension of) PAPR. At the
>moment, the way things work in PAPR is that the guest is responsible for
>querying the EEH state when something "looks" like an error (ie, getting
>ff's back). This is also how it works in pHyp.
>
>We have an interrupt path in the host when doing "native" EEH, and it
>would be nice to extend PAPR to also be able to shoot an event to the
>guest possibly using RTAS events, but let's get the basics working and
>upstream first.
>

Got it. Thanks, Ben :-)

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 0/4] EEH Support for VFIO PCI device

2014-05-20 Thread Gavin Shan

The series of patches intends to support EEH for PCI devices, which are
passed through to PowerKVM based guest via VFIO. The implementation is
straightforward based on the issues or problems we have to resolve to
support EEH for PowerKVM based guest.

- Emulation for EEH RTAS requests. All EEH RTAS requests goes to QEMU firstly.
  If QEMU can't handle it, the request will be sent to host via newly introduced
  VFIO container IOCTL command (VFIO_EEH_OP) and gets handled in host kernel.

The series of patches requires corresponding QEMU changes.

Change log
==
v1 -> v2:
* EEH RTAS requests are routed to QEMU, and then possiblly to host 
kerenl.
  The mechanism KVM in-kernel handling is dropped.
* Error injection is reimplemented based syscall, instead of KVM 
in-kerenl
  handling. The logic for error injection token management is moved to
  QEMU. The error injection request is routed to QEMU and then possiblly
  to host kernel.
v2 -> v3:
* Make the fields in struct eeh_vfio_pci_addr, struct vfio_eeh_info 
based
  on the comments from Alexey.
* Define macros for EEH VFIO operations (Alexey).
* Clear frozen state after successful PE reset.
* Merge original [PATCH 1/2/3] to one.
v3 -> v4:
* Remove the error injection from the patchset. Mike or I will work on 
that
  later.
* Rename CONFIG_VFIO_EEH to VFIO_PCI_EEH.
* Rename the IOCTL command to VFIO_EEH_OP and it's handled by VFIO-PCI 
device
  instead of VFIO container.
* Rename the IOCTL argument structure to "vfio_eeh_op" accordingly. 
Also, more
  fields added to hold return values for RTAS requests.
* The address mapping stuff is totally removed. When opening or 
releasing VFIO
  PCI device, notification sent to EEH to update the flags indicates 
the device
  is passed to guest or not.
* Change pr_warn() to pr_debug() to avoid DOS as pointed by Alex.W
* Argument size check issue pointed by Alex.W.
v4 -> v5:
* Functions for VFIO PCI EEH support are moved to eeh.c and exported 
from there.
  VFIO PCI driver just uses those functions to tackle IOCTL command 
VFIO_EEH_OP.
  All of this is to make the code organized in a good way as suggested 
by Alex.G.
  Another potential benefit is PowerNV/pSeries are sharing "eeh_ops" 
and same
  infrastructure could possiblly work for KVM_PR and KVM_HV mode at the 
same time.
* Don't clear error injection registers after finishing PE reset as the 
patchset
  is doing nothing related to error injection.
* Amending Documentation/vfio.txt, which was missed in last revision.
* No QEMU changes for this revision. "v4" works well. Also, remove 
"RFC" from the
  subject as the design is basically recognized.



Gavin Shan (4):
  drivers/vfio: Introduce CONFIG_VFIO_PCI_EEH
  powerpc/eeh: Flags for passed device and PE
  drivers/vfio: EEH support for VFIO PCI device
  powerpc/eeh: Avoid event on passed PE

 Documentation/vfio.txt|   6 +-
 arch/powerpc/include/asm/eeh.h|  42 
 arch/powerpc/kernel/eeh.c | 331 ++
 arch/powerpc/platforms/powernv/eeh-ioda.c |   3 +-
 drivers/vfio/pci/Kconfig  |   6 +
 drivers/vfio/pci/vfio_pci.c   |  99 -
 include/uapi/linux/vfio.h |  43 
 7 files changed, 522 insertions(+), 8 deletions(-)

-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device

2014-05-20 Thread Gavin Shan

The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
to support EEH functionality for PCI devices, which have been
passed from host to guest via VFIO.

Signed-off-by: Gavin Shan 
---
 Documentation/vfio.txt |   6 +-
 arch/powerpc/include/asm/eeh.h |  10 ++
 arch/powerpc/kernel/eeh.c  | 323 +
 drivers/vfio/pci/vfio_pci.c|  99 -
 include/uapi/linux/vfio.h  |  43 ++
 5 files changed, 474 insertions(+), 7 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index b9ca023..bb17ec7 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -305,7 +305,10 @@ faster, the map/unmap handling has been implemented in 
real mode which provides
 an excellent performance which has limitations such as inability to do
 locked pages accounting in real time.
 
-So 3 additional ioctls have been added:
+4) PPC64 guests detect PCI errors and recover from them via EEH RTAS services,
+which works on the basis of additional ioctl command VFIO_EEH_OP.
+
+So 4 additional ioctls have been added:
 
VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
of the DMA window on the PCI bus.
@@ -316,6 +319,7 @@ So 3 additional ioctls have been added:
 
VFIO_IOMMU_DISABLE - disables the container.
 
+   VFIO_EEH_OP - EEH dependent operations
 
 The code flow from the example above should be slightly changed:
 
diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 34a2d83..93922ef 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -305,6 +305,16 @@ void eeh_add_device_late(struct pci_dev *);
 void eeh_add_device_tree_late(struct pci_bus *);
 void eeh_add_sysfs_files(struct pci_bus *);
 void eeh_remove_device(struct pci_dev *);
+#ifdef CONFIG_VFIO_PCI_EEH
+int eeh_vfio_open(struct pci_dev *pdev);
+void eeh_vfio_release(struct pci_dev *pdev);
+int eeh_vfio_set_pe_option(struct pci_dev *pdev, int option, int *retval);
+int eeh_vfio_get_pe_addr(struct pci_dev *pdev, int option,
+int *retval, int *info);
+int eeh_vfio_get_pe_state(struct pci_dev *pdev, int *retval, int *state);
+int eeh_vfio_reset_pe(struct pci_dev *pdev, int option, int *retval);
+int eeh_vfio_configure_pe(struct pci_dev *pdev, int *retval);
+#endif
 
 /**
  * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 9c6b899..2aaf90e 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1098,6 +1098,329 @@ void eeh_remove_device(struct pci_dev *dev)
edev->mode &= ~EEH_DEV_SYSFS;
 }
 
+#ifdef CONFIG_VFIO_PCI_EEH
+int eeh_vfio_open(struct pci_dev *pdev)
+{
+   struct eeh_dev *edev;
+
+   /* No PCI device ? */
+   if (!pdev)
+   return -ENODEV;
+
+   /* No EEH device ? */
+   edev = pci_dev_to_eeh_dev(pdev);
+   if (!edev || !edev->pe)
+   return -ENODEV;
+
+   eeh_dev_set_passed(edev, true);
+   eeh_pe_set_passed(edev->pe, true);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(eeh_vfio_open);
+
+void eeh_vfio_release(struct pci_dev *pdev)
+{
+   bool release_pe = true;
+   struct eeh_pe *pe = NULL;
+   struct eeh_dev *tmp, *edev;
+
+   /* No PCI device ? */
+   if (!pdev)
+   return;
+
+   /* No EEH device ? */
+   edev = pci_dev_to_eeh_dev(pdev);
+   if (!edev || !eeh_dev_passed(edev) ||
+   !edev->pe || !eeh_pe_passed(pe))
+   return;
+
+   /* Release device */
+   pe = edev->pe;
+   eeh_dev_set_passed(edev, false);
+
+   /* Release PE */
+   eeh_pe_for_each_dev(pe, edev, tmp) {
+   if (eeh_dev_passed(edev)) {
+   release_pe = false;
+   break;
+   }
+   }
+
+   if (release_pe)
+   eeh_pe_set_passed(pe, false);
+}
+EXPORT_SYMBOL(eeh_vfio_release);
+
+static int eeh_vfio_check_dev(struct pci_dev *pdev,
+ struct eeh_dev **pedev,
+ struct eeh_pe **ppe)
+{
+   struct eeh_dev *edev;
+
+   /* No device ? */
+   if (!pdev)
+   return -ENODEV;
+
+   edev = pci_dev_to_eeh_dev(pdev);
+   if (!edev || !eeh_dev_passed(edev) ||
+   !edev->pe || !eeh_pe_passed(edev->pe))
+   return -ENODEV;
+
+   if (pedev)
+   *pedev = edev;
+   if (ppe)
+   *ppe = edev->pe;
+
+   return 0;
+}
+
+int eeh_vfio_set_pe_option(struct pci_dev *pdev, int option, int *retval)
+{
+   struct eeh_dev *edev;
+   struct eeh_pe *pe;
+   int ret = 0;
+
+   /* Device existing ? */
+   ret = eeh_vfio_check_dev(pdev, &edev, &pe);
+   if (ret) {
+   pr_debug("%s: Cannot find device %s\n",
+   __func__, pdev ? pci_name(pdev) : "NULL");
+   *retval = -7;
+

[PATCH v5 2/4] powerpc/eeh: Flags for passed device and PE

2014-05-20 Thread Gavin Shan

The patch introduces new flags for EEH device and PE to indicate
that the device or PE has been passed through to guest. In turn,
we will deliver EEH errors to guest for further handling, which
will be done in subsequent patches.

Signed-off-by: Gavin Shan 
---
 arch/powerpc/include/asm/eeh.h | 32 
 1 file changed, 32 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 7782056..34a2d83 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -72,6 +72,7 @@ struct device_node;
 #define EEH_PE_RESET   (1 << 2)/* PE reset in progress */
 
 #define EEH_PE_KEEP(1 << 8)/* Keep PE on hotplug   */
+#define EEH_PE_PASSTHROUGH (1 << 9)/* PE owned by guest*/
 
 struct eeh_pe {
int type;   /* PE type: PHB/Bus/Device  */
@@ -93,6 +94,21 @@ struct eeh_pe {
 #define eeh_pe_for_each_dev(pe, edev, tmp) \
list_for_each_entry_safe(edev, tmp, &pe->edevs, list)
 
+static inline bool eeh_pe_passed(struct eeh_pe *pe)
+{
+   return pe ? !!(pe->state & EEH_PE_PASSTHROUGH) : false;
+}
+
+static inline void eeh_pe_set_passed(struct eeh_pe *pe, bool passed)
+{
+   if (pe) {
+   if (passed)
+   pe->state |= EEH_PE_PASSTHROUGH;
+   else
+   pe->state &= ~EEH_PE_PASSTHROUGH;
+   }
+}
+
 /*
  * The struct is used to trace EEH state for the associated
  * PCI device node or PCI device. In future, it might
@@ -110,6 +126,7 @@ struct eeh_pe {
 #define EEH_DEV_SYSFS  (1 << 9)/* Sysfs created*/
 #define EEH_DEV_REMOVED(1 << 10)   /* Removed permanently  
*/
 #define EEH_DEV_FRESET (1 << 11)   /* Fundamental reset*/
+#define EEH_DEV_PASSTHROUGH(1 << 12)   /* Owned by guest   */
 
 struct eeh_dev {
int mode;   /* EEH mode */
@@ -138,6 +155,21 @@ static inline struct pci_dev *eeh_dev_to_pci_dev(struct 
eeh_dev *edev)
return edev ? edev->pdev : NULL;
 }
 
+static inline bool eeh_dev_passed(struct eeh_dev *dev)
+{
+   return dev ? !!(dev->mode & EEH_DEV_PASSTHROUGH) : false;
+}
+
+static inline void eeh_dev_set_passed(struct eeh_dev *dev, bool passed)
+{
+   if (dev) {
+   if (passed)
+   dev->mode |= EEH_DEV_PASSTHROUGH;
+   else
+   dev->mode &= ~EEH_DEV_PASSTHROUGH;
+   }
+}
+
 /* Return values from eeh_ops::next_error */
 enum {
EEH_NEXT_ERR_NONE = 0,
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 1/4] drivers/vfio: Introduce CONFIG_VFIO_PCI_EEH

2014-05-20 Thread Gavin Shan

The patch introduces CONFIG_VFIO_PCI_EEH for more IOCTL commands
on VFIO PCI device support EEH funtionality for PCI devices that
are passed through from host to guest.

Signed-off-by: Gavin Shan 
---
 drivers/vfio/pci/Kconfig | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index c41b01e..dd0e0f0 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -1,6 +1,7 @@
 config VFIO_PCI
tristate "VFIO support for PCI devices"
depends on VFIO && PCI && EVENTFD
+   select VFIO_PCI_EEH if PPC_POWERNV
help
  Support for the PCI VFIO bus driver.  This is required to make
  use of PCI drivers using the VFIO framework.
@@ -16,3 +17,8 @@ config VFIO_PCI_VGA
  BIOS and generic video drivers.
 
  If you don't know what to do here, say N.
+
+config VFIO_PCI_EEH
+   tristate
+   depends on EEH && VFIO_IOMMU_SPAPR_TCE
+   default n
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Gavin Shan

If we detects frozen state on PE that has been passed to guest, we
needn't handle it. Instead, we rely on the guest to detect and recover
it. The patch avoid EEH event on the frozen passed PE so that the guest
can have chance to handle that.

Signed-off-by: Gavin Shan 
---
 arch/powerpc/kernel/eeh.c | 8 
 arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 2aaf90e..25fd12d 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
if (ret > 0)
return ret;
 
+   /*
+* If the PE has been passed to guest, we won't check the
+* state. Instead, let the guest handle it if the PE has
+* been frozen.
+*/
+   if (eeh_pe_passed(pe))
+   return 0;
+
/* If we already have a pending isolation event for this
 * slot, we know it's bad already, we don't need to check.
 * Do this checking under a lock; as multiple PCI devices
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c 
b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 1b5982f..03a3ed2 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -890,7 +890,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
opal_pci_eeh_freeze_clear(phb->opal_id, 
frozen_pe_no,
OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
ret = EEH_NEXT_ERR_NONE;
-   } else if ((*pe)->state & EEH_PE_ISOLATED) {
+   } else if ((*pe)->state & EEH_PE_ISOLATED ||
+  eeh_pe_passed(*pe)) {
ret = EEH_NEXT_ERR_NONE;
} else {
pr_err("EEH: Frozen PHB#%x-PE#%x (%s) 
detected\n",
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Alexander Graf



> Am 21.05.2014 um 02:13 schrieb Benjamin Herrenschmidt 
> :
> 
>> On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote:
>> Instead of
>> 
>>   if (passed_flag)
>> return;
>> 
>> you would do
>> 
>>   if (trigger_irqfd) {
>> trigger_irqfd();
>> return;
>>   }
>> 
>> which would be a much nicer, generic interface.
> 
> But that's not how PAPR works.

But it's what a non-QEMU VFIO user would want, and it should be easy to 
implement.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] powerpc/eeh: Avoid event on passed PE

2014-05-20 Thread Alexander Graf



> Am 21.05.2014 um 02:19 schrieb Benjamin Herrenschmidt 
> :
> 
>> On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote:
>> So how about we just implement this whole thing properly as irqfd? 
>> Whether QEMU can actually do anything with the interrupt is a different 
>> question - we can leave it be for now. But we could model all the code 
>> with the assumption that it should either handle the error itself or 
>> trigger and irqfd write.
> 
> I don't object to the idea... however this smells of Deja Vu...
> 
> You often tend to want to turn something submitted that fills a specific
> gap and implements a specific spec/function into some kind of idealized
> grand design :-) And that means nothing gets upstream for weeks or monthes
> as we churn and churn...
> 
> Sometimes it's probably worth it. Here I would argue against it and would
> advocate for doing the basic functionality first, as it is used by guests,
> and later add the irqfd option. I don't see any emergency here and adding
> the irqfd will not cause fundamental design changes:
> 
> The "passed" flag (though I'm not fan of the name) is really something
> we want in the low level handlers to avoid triggering host side EEH in
> various places, regardless of whether we use irqfd or not.
> 
> This is totally orthogonal from the mechanism used for notifications.
> 
> Even in host, the detection path doesn't always involve interrupts, and
> we can detect some things as a result of a host side config space access
> for example etc...
> 
> So let's keep things nice and separate here. The interrupt notification
> is just an "optimization" which will speed up discovery of the error in
> *some* cases later on (but adds its own complexity since we have multiple
> discovery path in host, so we need to keep track whether we have notified
> yet or not etc...) so let's keep it for later.

EEH handling is your call, but I only see reduced complexity here. I won't nak 
the current approach though.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-20 Thread Alexander Graf



> Am 21.05.2014 um 02:23 schrieb Benjamin Herrenschmidt 
> :
> 
> On Tue, 2014-05-20 at 22:39 +1000, Gavin Shan wrote:
 Yeah. How about this? :-)
 
 - Move eeh-vfio.c to drivers/vfio/pci/
 - From eeh-vfio.c, dereference arch/powerpc/kernel/eeh.c::eeh_ops, which
  is arch/powerpc/plaforms/powernv/eeh-powernv.c::powernv_eeh_ops. Call
>>> 
>>> Hrm, I think it'd be nicer to just export individual functions that
>>> do thing you want to do from eeh.c.
>> 
>> Ok. Got it. Thanks for your comments :)
> 
> The interesting thing with this approach is that VFIO per-se can work
> with EEH RTAS backend too in the host.
> 
> IE, with PR KVM for example or with non-KVM uses of VFIO, it would be
> possible to use a device in a user process and exploit EEH even when
> running under a PAPR hypervisor.
> 
> That is, vfio-eeh uses "generic" exported EEH APIs from the EEH core
> that will work on both powernv and RTAS backends.
> 
> Note to Alex: This definitely kills the notifier idea for now though,
> at least as a first class citizen of the design. We can add it as an
> optional optimization on top later.

I don't think it does. The notifier would just get triggered on config space 
read failures for example :). It's really just an aid for the vfio user to have 
a common code path for error handling.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

37 matches

Mail list logo