Hello,

Here are changes that allow EEH to successfully recover after a failure that
affects of both host and guest devices. This happens, for example, when a PHB
containing passed-through devices is fenced. (Failures that include only
passed-through devices are ignored by the host.)

Currently, when an error affects both passed-through and un-passed-through
devices, the passed-through devices are treated as if their driver was not EEH
aware. This causes them to be hot-unplugged as part of recovery.

The hot unplug request is forwarded to the guest which checks the device status
before releasing the device. Because the host is recovering the device, it
reports the device status as EEH_STATE_UNAVAILABLE which causes the guest to
wait for the device to become available. This deadlocks the recovery process.

This change causes the host to recover it's own devices but leave
passed-through devices frozen until the guest performs it's own recovery. (They
are not removed.) If the guest detects the error and begins recovery itself,
waiting for the device state to change away from EEH_STATE_UNAVAILABLE causes
it to wait until the host has finished it's recovery and the guest's subsequent
recovery can then succeed.

Note that resetting a PE may implicitly thaw both it and child PEs, and to
prevent the device from being accidentally used by the guest (which may be
unaware of the failure and reset) when in this state, we re-freeze those
devices. This does leave a small window of opportunity but that will need to be
addressed with a firmware change.

I've also included a fix to the reset function (the last patch), because
without it some scenarios still fail. An example is injecting an error into
a PHB and then exiting a guest that contains passed-through devices from that
PHB so that an EEH event is raised during the process of passing the device
back to the host.

Cheers,
Sam.

Sam Bobroff (6):
  powerpc/eeh: Cleanup eeh_pe_clear_frozen_state()
  powerpc/eeh: remove sw_state from eeh_unfreeze_pe()
  powerpc/eeh: Add include_passed to eeh_pe_state_clear()
  powerpc/eeh: Add include_passed to eeh_clear_pe_frozen_state()
  powerpc/eeh: Improve recovery of passed-through devices
  powerpc/eeh: Correct retries in eeh_pe_reset_full()

 arch/powerpc/include/asm/eeh.h     |   4 +-
 arch/powerpc/include/asm/ppc-pci.h |   4 +-
 arch/powerpc/kernel/eeh.c          | 103 +++++++++++++++++++----------
 arch/powerpc/kernel/eeh_driver.c   |  86 ++++++++++--------------
 arch/powerpc/kernel/eeh_pe.c       |  68 ++++++++-----------
 arch/powerpc/kernel/eeh_sysfs.c    |   3 +-
 drivers/vfio/vfio_spapr_eeh.c      |   6 +-
 7 files changed, 140 insertions(+), 134 deletions(-)

-- 
2.19.0.2.gcad72f5712

Reply via email to