Hello, Here are changes that allow EEH to successfully recover after a failure that affects of both host and guest devices. This happens, for example, when a PHB containing passed-through devices is fenced. (Failures that include only passed-through devices are ignored by the host.)
Currently, when an error affects both passed-through and un-passed-through devices, the passed-through devices are treated as if their driver was not EEH aware. This causes them to be hot-unplugged as part of recovery. The hot unplug request is forwarded to the guest which checks the device status before releasing the device. Because the host is recovering the device, it reports the device status as EEH_STATE_UNAVAILABLE which causes the guest to wait for the device to become available. This deadlocks the recovery process. This change causes the host to recover it's own devices but leave passed-through devices frozen until the guest performs it's own recovery. (They are not removed.) If the guest detects the error and begins recovery itself, waiting for the device state to change away from EEH_STATE_UNAVAILABLE causes it to wait until the host has finished it's recovery and the guest's subsequent recovery can then succeed. Note that resetting a PE may implicitly thaw both it and child PEs, and to prevent the device from being accidentally used by the guest (which may be unaware of the failure and reset) when in this state, we re-freeze those devices. This does leave a small window of opportunity but that will need to be addressed with a firmware change. I've also included a fix to the reset function (the last patch), because without it some scenarios still fail. An example is injecting an error into a PHB and then exiting a guest that contains passed-through devices from that PHB so that an EEH event is raised during the process of passing the device back to the host. Cheers, Sam. Sam Bobroff (6): powerpc/eeh: Cleanup eeh_pe_clear_frozen_state() powerpc/eeh: remove sw_state from eeh_unfreeze_pe() powerpc/eeh: Add include_passed to eeh_pe_state_clear() powerpc/eeh: Add include_passed to eeh_clear_pe_frozen_state() powerpc/eeh: Improve recovery of passed-through devices powerpc/eeh: Correct retries in eeh_pe_reset_full() arch/powerpc/include/asm/eeh.h | 4 +- arch/powerpc/include/asm/ppc-pci.h | 4 +- arch/powerpc/kernel/eeh.c | 103 +++++++++++++++++++---------- arch/powerpc/kernel/eeh_driver.c | 86 ++++++++++-------------- arch/powerpc/kernel/eeh_pe.c | 68 ++++++++----------- arch/powerpc/kernel/eeh_sysfs.c | 3 +- drivers/vfio/vfio_spapr_eeh.c | 6 +- 7 files changed, 140 insertions(+), 134 deletions(-) -- 2.19.0.2.gcad72f5712